Nvidia Coding Interview Questions : OA, Technical Rounds & Onsite

Nvidia is one of the most sought-after tech companies for engineers interested in GPU computing, AI infrastructure, and high-performance systems. Nvidia hires for SDE, GPU Software Engineer, Systems Software Engineer, Compiler Engineer, and ML Infrastructure roles. The interview process is rigorous, with a strong focus on C++, data structures, algorithms, and systems-level thinking.


Hiring Process

Nvidia's interview process typically has 5–6 stages:

Step 1 — Online Application Apply via Nvidia Careers. Strong GPA, relevant internships, and projects in GPU computing, systems programming, or ML infrastructure stand out.

Step 2 — Online Assessment (OA) Hosted on HackerRank or CodeSignal. 2–3 coding problems in 90 minutes. Focus on DSA — arrays, strings, DP, graphs. Some roles include C++ MCQs on memory management and templates.

Step 3 — Phone Screen 45–60 minute technical interview with an engineer. 1–2 medium/hard coding problems. Expect follow-ups on time and space complexity, edge cases, and code correctness.

Step 4 — Technical Onsite Round 1 (Coding) 60–75 minutes. 1–2 coding problems, often involving graphs, trees, or dynamic programming. Interviewer evaluates problem-solving approach and clean code.

Step 5 — Technical Onsite Round 2 (Systems / Architecture) Deep dive into systems design, memory management, parallel computing concepts, or compiler design depending on the role. GPU-specific roles may include CUDA or parallel algorithm questions.

Step 6 — Hiring Manager + HR Round Behavioral questions, team fit, and offer negotiation. Nvidia values ownership, technical depth, and curiosity about GPU/AI tech.

Tips: Nvidia interviewers pay close attention to how you handle edge cases and optimize solutions. Practice writing clean, production-quality C++ code. For GPU/systems roles, understand cache coherency, memory hierarchies, and parallel execution models.


Preparation Resources

Part 1 — Online Assessment (OA) Questions

1. Trapping Rain Water Hard

OATwo Pointers / StackLeetCode: #42

Given n non-negative integers representing an elevation map where the width of each bar is 1, compute how much water it can trap after raining.

Input: height = [0,1,0,2,1,0,1,3,2,1,2,1]
Output: 6

Approach: Use two pointers from both ends. Track leftMax and rightMax. At each step, process the side with the smaller max — water at that position is max - height[i]. Complexity: Time O(n) · Space O(1) Follow-up: Solve using a monotonic stack approach.


2. Minimum Window Substring Hard

OASliding WindowLeetCode: #76

Given strings s and t, return the minimum window substring of s that contains all characters of t. Return empty string if no such window exists.

Input: s = "ADOBECODEBANC", t = "ABC"
Output: "BANC"

Approach: Sliding window with two frequency maps. Expand right pointer until all chars of t are covered, then shrink left pointer to minimize window. Track best window seen. Complexity: Time O(|s| + |t|) · Space O(|s| + |t|) Follow-up: What if t has duplicate characters?


3. Word Break Medium

OADynamic ProgrammingLeetCode: #139

Given a string s and a dictionary of strings wordDict, return true if s can be segmented into a space-separated sequence of one or more dictionary words.

Input: s = "leetcode", wordDict = ["leet","code"]
Output: true

Approach: DP array dp[i] = true if s[0..i-1] can be segmented. For each position i, check all substrings ending at i — if dp[j] is true and s[j..i] is in the dictionary, set dp[i] = true. Complexity: Time O(n^2) · Space O(n) Follow-up: Return all possible segmentations (Word Break II).


4. Merge Intervals Medium

OAArrays / SortingLeetCode: #56

Given an array of intervals, merge all overlapping intervals and return an array of the non-overlapping intervals.

Input: intervals = [[1,3],[2,6],[8,10],[15,18]]
Output: [[1,6],[8,10],[15,18]]

Approach: Sort intervals by start time. Iterate and check if current interval overlaps with the last merged interval. If yes, extend end. Otherwise, add new interval. Complexity: Time O(n log n) · Space O(n) Follow-up: Insert a new interval into a sorted list of non-overlapping intervals.


5. Maximum Subarray Medium

OADynamic Programming / KadaneLeetCode: #53

Given an integer array nums, find the subarray with the largest sum, and return its sum.

Input: nums = [-2,1,-3,4,-1,2,1,-5,4]
Output: 6  (subarray [4,-1,2,1])

Approach: Kadane's algorithm. Track currentSum = max(nums[i], currentSum + nums[i]) and update maxSum at each step. Complexity: Time O(n) · Space O(1) Follow-up: Return the actual subarray indices, not just the sum.


6. Number of Islands Medium

OABFS / DFS / Union-FindLeetCode: #200

Given an m x n 2D binary grid of '1's (land) and '0's (water), return the number of islands.

Input: grid = [["1","1","0"],["1","1","0"],["0","0","1"]]
Output: 2

Approach: DFS/BFS from each unvisited '1'. Mark all connected land cells as visited. Count how many DFS/BFS calls you make — that is the number of islands. Complexity: Time O(m*n) · Space O(m*n) Follow-up: Solve using Union-Find. What about islands in 3D?


Part 2 — Phone Screen Questions

7. Serialize and Deserialize Binary Tree Hard

Phone ScreenTrees / BFSLeetCode: #297

Design an algorithm to serialize and deserialize a binary tree. There is no restriction on how your serialization/deserialization algorithm should work.

Input: root = [1,2,3,null,null,4,5]
Output: [1,2,3,null,null,4,5]

Approach: BFS-based serialization — level-order traversal with null markers. For deserialization, use a queue to reconstruct the tree level by level, processing each node's children in order. Complexity: Time O(n) · Space O(n) Follow-up: Serialize using preorder DFS. Which is more compact?


8. Find Median from Data Stream Hard

Phone ScreenHeap / DesignLeetCode: #295

Design a data structure that supports adding integers and finding the median of all elements seen so far.

Input: addNum(1), addNum(2), findMedian() -> 1.5, addNum(3), findMedian() -> 2.0

Approach: Maintain a max-heap for the lower half and a min-heap for the upper half. Keep them balanced (size diff at most 1). Median is the top of the larger heap or average of both tops. Complexity: Time O(log n) add · O(1) findMedian · Space O(n) Follow-up: What if 99% of numbers are in [0, 100]? How can you optimize memory?


9. Course Schedule II Medium

Phone ScreenTopological Sort / DAGLeetCode: #210

There are n courses with prerequisites given as pairs. Return an ordering of courses to finish all courses, or an empty array if it is impossible.

Input: numCourses = 4, prerequisites = [[1,0],[2,0],[3,1],[3,2]]
Output: [0,1,2,3]

Approach: Topological sort using Kahn's algorithm (BFS with in-degree tracking). If all nodes are processed, return order; otherwise, cycle exists — return empty array. Complexity: Time O(V + E) · Space O(V + E) Follow-up: How would you detect which specific courses form the cycle?


10. Longest Palindromic Substring Medium

Phone ScreenExpand Around Center / DPLeetCode: #5

Given a string s, return the longest palindromic substring in s.

Input: s = "babad"
Output: "bab"

Approach: Expand Around Center — for each character (and between each pair), expand outward while left and right characters match. Track the longest palindrome found. Complexity: Time O(n^2) · Space O(1) Follow-up: Manacher's algorithm achieves O(n) — explain the idea.


11. LRU Cache Medium

Phone ScreenHashMap + Doubly Linked ListLeetCode: #146

Design a data structure that follows the Least Recently Used (LRU) cache eviction policy with get and put operations in O(1).

Input: LRUCache(2), put(1,1), put(2,2), get(1)->1, put(3,3), get(2)->-1

Approach: Combine a HashMap (O(1) lookup) with a doubly linked list (O(1) move-to-front and evict-from-tail). Maintain dummy head and tail sentinel nodes. Complexity: Time O(1) · Space O(capacity) Follow-up: Implement LFU (Least Frequently Used) cache.


12. Decode Ways Medium

Phone ScreenDynamic ProgrammingLeetCode: #91

A string of digits can be decoded using the mapping 'A'->1, 'B'->2, ..., 'Z'->26. Given a string s, return the number of ways to decode it.

Input: s = "226"
Output: 3  ("2,2,6" / "22,6" / "2,26")

Approach: DP where dp[i] = number of ways to decode s[0..i-1]. Check single digit decode and two-digit decode (10-26) at each step. Complexity: Time O(n) · Space O(n) (optimizable to O(1)) Follow-up: What if the encoded string has '*' wildcards?


13. Maximum Product Subarray Medium

Phone ScreenDynamic ProgrammingLeetCode: #152

Given an integer array nums, find a subarray that has the largest product, and return the product.

Input: nums = [2,3,-2,4]
Output: 6  (subarray [2,3])

Approach: Track both maxProd and minProd at each position (negative negative = positive). At each step: maxProd = max(nums[i], maxProd*nums[i], minProd*nums[i]). *Complexity: Time O(n) · Space O(1) Follow-up: What if the array contains zeros?


Part 3 — Onsite Interview Questions

14. Binary Tree Maximum Path Sum Hard

OnsiteTrees / DFSLeetCode: #124

A path in a binary tree is a sequence of nodes where each pair of adjacent nodes has an edge. Return the maximum path sum where the path can start and end at any node.

Input: root = [-10,9,20,null,null,15,7]
Output: 42  (path: 15 -> 20 -> 7)

Approach: Post-order DFS. For each node, compute max gain from left and right subtrees (clamped to 0). Update global max with node.val + leftGain + rightGain. Return node.val + max(leftGain, rightGain) to parent. Complexity: Time O(n) · Space O(h) Follow-up: What if negative values should never be included?


15. Sliding Window Maximum Hard

OnsiteMonotonic DequeLeetCode: #239

Given an array nums and an integer k, return the max value of each sliding window of size k.

Input: nums = [1,3,-1,-3,5,3,6,7], k = 3
Output: [3,3,5,5,6,7]

Approach: Monotonic deque (decreasing). For each element, remove indices outside the window from front. Remove elements smaller than current from back. Front always holds the max index. Complexity: Time O(n) · Space O(k) Follow-up: How would you extend this to a 2D sliding window max?


16. Design GPU Memory Allocator Hard

OnsiteSystems Design / MemoryLeetCode: #2502

Design a memory allocator for a fixed-size memory space. Implement allocate(size, mID) to find the first free block of size consecutive free units, and free(mID) to free all memory units with the given ID.

Input: allocate(1,1)->0, allocate(1,2)->1, allocate(1,3)->2, free(2)->1, allocate(3,4)->3

Approach: Maintain an array representing memory units. For allocate, scan for size consecutive free units (first-fit). For free, scan and unmark all units with matching mID. For Nvidia roles, discuss fragmentation and best-fit vs first-fit tradeoffs. Complexity: Time O(n) allocate/free · Space O(n) Follow-up: Nvidia follow-up: How does GPU memory management differ from CPU? Discuss VRAM allocation, memory pools, and buddy allocator systems.


17. Implement Trie (Prefix Tree) Medium

OnsiteTrie / DesignLeetCode: #208

Implement a Trie with insert, search, and startsWith methods. All inputs consist of lowercase English letters.

Input: insert("apple"), search("apple")->true, search("app")->false, startsWith("app")->true

Approach: Each TrieNode has an array of 26 children and a boolean isEnd. Insert by traversing/creating nodes. Search checks isEnd at last character. StartsWith only checks path existence. Complexity: Time O(m) per operation · Space O(26 * m * n) Follow-up: Implement delete operation. How would you compress the trie (Patricia/Radix tree)?


18. Pacific Atlantic Water Flow Medium

OnsiteBFS / DFS / Multi-sourceLeetCode: #417

Given an m x n matrix of heights, find all cells from which water can flow to both the Pacific (top/left) and Atlantic (bottom/right) oceans.

Input: heights = [[1,2,2,3,5],[3,2,3,4,4],[2,4,5,3,1],[6,7,1,4,5],[5,1,1,2,4]]
Output: [[0,4],[1,3],[1,4],[2,2],[3,0],[3,1],[4,0]]

Approach: Reverse BFS — start from ocean borders and flood inward (where height is >= current). Find intersection of Pacific-reachable and Atlantic-reachable cells. Complexity: Time O(m*n) · Space O(m*n) Follow-up: What if water flows to neighboring cells regardless of height difference?


19. Design a Thread-Safe Bounded Queue Hard

OnsiteConcurrency / SystemsLeetCode: #1188

Implement a thread-safe bounded blocking queue with enqueue (blocks if full) and dequeue (blocks if empty) operations.

Input: BoundedBlockingQueue(2), enqueue(1), enqueue(0), dequeue()->1, dequeue()->0

Approach: Use a deque with a mutex and two condition variables (not_full, not_empty). Enqueue waits on not_full, signals not_empty. Dequeue does the reverse. Critical section protects deque operations. Complexity: Time O(1) amortized · Space O(capacity) Follow-up: Nvidia context: How is this pattern used in GPU command queues and driver-kernel communication?


20. K Closest Points to Origin Medium

OnsiteHeap / QuickselectLeetCode: #973

Given an array of points on a 2D plane, return the k closest points to the origin (0, 0), sorted by distance.

Input: points = [[1,3],[-2,2]], k = 1
Output: [[-2,2]]

Approach: Max-heap of size k — for each point, if heap has less than k elements or point is closer than heap top, push it and pop the farthest. Alternatively, use Quickselect for O(n) average. Complexity: Time O(n log k) heap / O(n) Quickselect · Space O(k) Follow-up: What if points are streamed in real-time and you need top-k continuously?


21. Jump Game II Medium

OnsiteGreedyLeetCode: #45

Given array nums where nums[i] is the max jump length at position i, return the minimum number of jumps to reach the last index.

Input: nums = [2,3,1,1,4]
Output: 2  (jump to index 1, then to last)

Approach: Greedy BFS. Track the farthest reachable position in the current jump (curEnd) and the overall farthest reachable (farthest). Increment jumps when reaching curEnd. Complexity: Time O(n) · Space O(1) Follow-up: What is the minimum number of jumps if you can also jump backward?


22. Unique Paths in a Grid with Obstacles Medium

OnsiteDynamic ProgrammingLeetCode: #63

A robot on an m x n grid moves only right or down. Some cells have obstacles. Count the number of unique paths from top-left to bottom-right.

Input: obstacleGrid = [[0,0,0],[0,1,0],[0,0,0]]
Output: 2

Approach: DP table where dp[i][j] = dp[i-1][j] + dp[i][j-1] if no obstacle, else 0. Initialize first row and column carefully — block propagation after any obstacle. Complexity: Time O(m*n) · Space O(m*n) -> optimize to O(n) Follow-up: Nvidia context: Model GPU thread block scheduling as a path-counting problem.


23. Reverse Nodes in k-Group Hard

OnsiteLinked ListLeetCode: #25

Given the head of a linked list, reverse the nodes of the list k at a time, and return the modified list. Nodes that are left with fewer than k remaining should stay in original order.

Input: head = [1,2,3,4,5], k = 2
Output: [2,1,4,3,5]

Approach: Count k nodes ahead. If fewer than k remain, stop. Reverse the group in-place using three pointers, then recursively process the next group. Connect reversed group to previous tail. Complexity: Time O(n) · Space O(n/k) recursive stack Follow-up: Solve iteratively to achieve O(1) space.


24. Design a Parallel Task Scheduler Hard

OnsiteSystems Design / SchedulingLeetCode: #621

Given a list of tasks (each labeled A-Z) and a cooldown period n, find the least number of CPU intervals to finish all tasks. CPU can be idle between task executions.

Input: tasks = ["A","A","A","B","B","B"], n = 2
Output: 8

Approach: Count task frequencies. Most frequent task determines the frame size. Formula: max((maxFreq-1)*(n+1) + countOfMaxFreq, tasks.length). Nvidia follow-up: extend to GPU kernel scheduling with warp occupancy constraints. Complexity: Time O(n) · Space O(1) Follow-up: Design for GPU: how would warp divergence and SM occupancy affect scheduling decisions?


25. Alien Dictionary Hard

OnsiteTopological Sort / GraphLeetCode: #269

Given a list of words sorted lexicographically in an alien language, derive the order of characters in that language.

Input: words = ["wrt","wrf","er","ett","rftt"]
Output: "wertf"

Approach: Build a directed graph from character order implied by adjacent word comparisons. Run topological sort (BFS/Kahn's). Return empty string if cycle detected. Complexity: Time O(C) (C = total chars) · Space O(1) (at most 26 chars) Follow-up: What if two words violate lexicographic order (prefix before shorter word)?


26. Count of Smaller Numbers After Self Hard

OnsiteMerge Sort / BIT / Segment TreeLeetCode: #315

Given an integer array nums, return an integer array counts where counts[i] is the number of smaller elements to the right of nums[i].

Input: nums = [5,2,6,1]
Output: [2,1,1,0]

Approach: Modified merge sort — during the merge step, count inversions. When a right-side element is placed before left-side elements, those left-side elements all have a count increment. Complexity: Time O(n log n) · Space O(n) Follow-up: How would you solve this using a Binary Indexed Tree (Fenwick Tree)?


27. Longest Increasing Subsequence Medium

OnsiteDynamic Programming / Binary SearchLeetCode: #300

Given an integer array nums, return the length of the longest strictly increasing subsequence.

Input: nums = [10,9,2,5,3,7,101,18]
Output: 4  (subsequence [2,3,7,101])

Approach: Patience sorting / binary search. Maintain a tails array where tails[i] is the smallest tail element of all increasing subsequences of length i+1. Binary search to update. Complexity: Time O(n log n) · Space O(n) Follow-up: Reconstruct the actual subsequence, not just the length.


28. Basic Calculator II Medium

OnsiteStack / String ParsingLeetCode: #227

Implement a basic calculator to evaluate a string expression containing non-negative integers and operators +, -, *, /.

Input: s = "3+2*2"
Output: 7

Approach: Use a stack. Scan left to right. For +/- push number (with sign) to stack. For *// pop top, compute with current number, push result. Sum the stack at end. Complexity: Time O(n) · Space O(n) Follow-up: Extend to support parentheses (Basic Calculator III).


29. Reconstruct Itinerary Hard

OnsiteHierholzer's Algorithm / Euler PathLeetCode: #332

Given a list of airline tickets as [from, to] pairs, reconstruct the itinerary starting from "JFK" using all tickets exactly once in lexical order.

Input: tickets = [["MUC","LHR"],["JFK","MUC"],["SFO","SJC"],["LHR","SFO"]]
Output: ["JFK","MUC","LHR","SFO","SJC"]

Approach: Hierholzer's algorithm for Eulerian path. Sort adjacency lists lexicographically. DFS — only add airport to result when its adjacency list is exhausted (post-order). Reverse at end. Complexity: Time O(E log E) · Space O(V + E) Follow-up: What if multiple valid itineraries exist and you need the lexicographically smallest?


30. Maximum Frequency Stack Hard

OnsiteDesign / HashMapLeetCode: #895

Design a stack-like data structure that pushes integers and pops the most frequent element (ties broken by most recently pushed).

Input: push(5),push(7),push(5),push(7),push(4),push(5), pop()->5, pop()->7, pop()->5, pop()->4

Approach: Maintain a freq map and a group map (freq -> stack of elements at that freq). Track maxFreq. On push, increment freq, add to group. On pop, remove from maxFreq group; decrement if group empty. Complexity: Time O(1) · Space O(n) Follow-up: What if you need to pop the least frequent element instead?


31. Remove Invalid Parentheses Hard

OnsiteBFS / BacktrackingLeetCode: #301

Remove the minimum number of invalid parentheses to make the input string valid. Return all possible results in any order.

Input: s = "()())()"
Output: ["(())()", "()()()"]

Approach: BFS — generate all strings by removing one character at a time. Stop at the first level that produces valid strings. Use a HashSet to avoid duplicate strings at each BFS level. Complexity: Time O(n * 2^n) · Space O(2^n) Follow-up: Can you solve it in O(n) with a two-pass approach if you only need count of removals?


32. Design a Distributed Key-Value Store Hard

OnsiteSystem Design

Design a distributed key-value store that supports get/put operations with high availability, fault tolerance, and eventual consistency.

Key Design Points:

  • Consistent hashing for key distribution across nodes (virtual nodes for load balance)
  • Replication factor N (e.g., 3): each key replicated to N nodes on the ring
  • Read/write quorums: W + R > N for strong consistency, or W=1/R=1 for availability
  • Vector clocks for conflict resolution in eventual consistency model
  • Gossip protocol for node membership and failure detection
  • Merkle trees for anti-entropy and replica synchronization

Nvidia context: Discuss how distributed KV stores are used in ML training (parameter servers, gradient aggregation, checkpoint storage). Follow-up: How does NVLink change assumptions about distributed training vs. traditional network-based approaches?


33. Largest Rectangle in Histogram Hard

OnsiteMonotonic StackLeetCode: #84

Given an array of integers heights representing the histogram bar height, return the area of the largest rectangle in the histogram.

Input: heights = [2,1,5,6,2,3]
Output: 10

Approach: Monotonic increasing stack. Push indices. When a bar shorter than stack top is found, pop and calculate area using the popped bar as the shortest bar — width extends from current stack top to current index. Complexity: Time O(n) · Space O(n) Follow-up: Maximal Rectangle in a binary matrix (LeetCode #85) — apply histogram solution row by row.


34. Implement a Read-Write Lock Hard

OnsiteConcurrency / Systems

Design a Read-Write Lock where multiple readers can read simultaneously but writers need exclusive access. Discuss fairness policies.

Key Implementation Points:

  • Reader-preference: readers never wait for writers unless a writer holds the lock — writer starvation risk
  • Writer-preference: new readers block if a writer is waiting — reader starvation risk
  • Fair (FIFO): use a ticket-based or queue-based approach — both readers and writers queued in arrival order
  • Use a mutex + condition variables: readCount atomic counter, writeLock condition

Nvidia context: GPU driver synchronization — command buffers can be read by many contexts simultaneously but must be written exclusively. Discuss cudaEventRecord and stream synchronization analogies. Follow-up: How do you handle upgrade from read lock to write lock atomically?


35. Sudoku Solver Hard

OnsiteBacktrackingLeetCode: #37

Write a program to solve a Sudoku puzzle by filling the empty cells. Each row, column, and 3x3 box must contain digits 1-9 exactly once.

Input: board (9x9 with some filled cells)
Output: Solved board (modified in-place)

Approach: Backtracking with constraint propagation. For each empty cell, try digits 1-9 that don't violate constraints. Recurse. If no valid digit exists, backtrack. Optimize with bitmask sets for row/col/box constraints. Complexity: Time O(9^m) (m = empty cells) · Space O(1) extra Follow-up: How would you parallelize Sudoku solving across multiple GPU threads?


36. Integer to English Words Hard

OnsiteString / RecursionLeetCode: #273

Convert a non-negative integer num to its English words representation.

Input: num = 1234567
Output: "One Million Two Hundred Thirty Four Thousand Five Hundred Sixty Seven"

Approach: Process in chunks of 3 digits using a helper helper(n) for numbers 0-999. Apply group names (Thousand, Million, Billion) at each chunk level. Handle edge cases: zero, teens (11-19), multiples of 10. Complexity: Time O(1) (bounded by max int) · Space O(1) Follow-up: Generate words in a different language (internationalization). How would you structure this?


37. Smallest Range Covering Elements from K Lists Hard

OnsiteHeap / Sliding WindowLeetCode: #632

You have k lists of sorted integers. Find the smallest range [a, b] such that there is at least one number from each list within the range.

Input: nums = [[4,10,15,24,26],[0,9,12,20],[5,18,22,30]]
Output: [20,24]

Approach: Min-heap with one element from each list. Track the current maximum. Pop min, update range if [min, curMax] is smaller, push next element from same list. Stop when any list is exhausted. Complexity: Time O(n log k) · Space O(k) Follow-up: How does this problem model finding a common window in distributed log streams?


38. Bitwise AND of Numbers Range Medium

OnsiteBit ManipulationLeetCode: #201

Given two integers left and right, return the bitwise AND of all numbers in the range [left, right].

Input: left = 5, right = 7
Output: 4

Approach: Find the common prefix of left and right in binary — shift both right until equal, tracking shift count. Result is the common prefix shifted back left. Bits that differ across the range become 0. Complexity: Time O(log n) · Space O(1) Follow-up: Nvidia context: How is bit masking used in GPU warp predicate execution and SIMT divergence?


39. Find All Anagrams in a String Medium

OnsiteSliding Window / HashMapLeetCode: #438

Given strings s and p, return an array of all start indices of p's anagrams in s.

Input: s = "cbaebabacd", p = "abc"
Output: [0,6]

Approach: Fixed-size sliding window of size p.length. Maintain a frequency diff counter — when all frequencies match (count == 0), record the window start. Slide right, updating frequencies. Complexity: Time O(n) · Space O(26) = O(1) Follow-up: What if characters in p can be Unicode (not just lowercase ASCII)?


40. Design a Rate Limiter Medium

OnsiteSystem Design / Sliding WindowLeetCode: #359

Design a rate limiter that allows at most N requests per second per user. Discuss token bucket, leaky bucket, fixed window, and sliding window approaches.

Approach Comparison:

  • Fixed window: simple, but burst at window boundary (2x allowed in 1 second)
  • Sliding window log: precise, but high memory (store timestamps per user)
  • Sliding window counter: approximate, O(1) memory using two counters
  • Token bucket: allows controlled bursts; tokens regenerate at fixed rate
  • Leaky bucket: smooths traffic; useful for Nvidia's GPU job submission queuing

Nvidia context: Rate limiting GPU API calls per process, throttling CUDA kernel launches, managing NVLink bandwidth allocation. Follow-up: Implement distributed rate limiting across multiple servers using Redis.


41. Median of Two Sorted Arrays Hard

OnsiteBinary SearchLeetCode: #4

Given two sorted arrays nums1 and nums2, return the median of the two sorted arrays. The overall run time complexity should be O(log(m+n)).

Input: nums1 = [1,3], nums2 = [2]
Output: 2.0

Approach: Binary search on the smaller array. Find partition points in both arrays such that elements left of partitions are less than elements right. Use maxLeft and minRight across both partitions to determine median. Complexity: Time O(log(min(m,n))) · Space O(1) Follow-up: Generalize to find the Kth smallest element in two sorted arrays.


42. Interleaving String Hard

OnsiteDynamic ProgrammingLeetCode: #97

Given strings s1, s2, and s3, determine if s3 is formed by an interleaving of s1 and s2.

Input: s1 = "aabcc", s2 = "dbbca", s3 = "aadbbcbcac"
Output: true

Approach: 2D DP where dp[i][j] = true if s3[0..i+j-1] can be formed by interleaving s1[0..i-1] and s2[0..j-1]. Transition: match s1[i-1] or s2[j-1] with current s3 char. Complexity: Time O(m*n) · Space O(m*n) -> optimize to O(n) Follow-up: Interleaving of K strings — how does complexity scale?


43. Minimum Cost to Hire K Workers Hard

OnsiteHeap / GreedyLeetCode: #857

There are n workers with given quality and wage values. Hire exactly k workers such that every chosen worker is paid at least their expected wage, and all are paid in proportion to their quality. Return the minimum total wage.

Input: quality = [10,20,5], wage = [70,50,30], k = 2
Output: 105.0

Approach: Sort workers by wage/quality ratio. For each worker as the highest ratio (captain), find cheapest k workers from those with lower ratios using a max-heap of quality. Total cost = captain's ratio sum of k qualities. *Complexity: Time O(n log n) · Space O(k) Follow-up: What if you need to minimize maximum wage disparity rather than total cost?


44. Split Array Largest Sum Hard

OnsiteBinary Search / DPLeetCode: #410

Given an integer array nums and an integer k, split nums into k non-empty subarrays to minimize the largest subarray sum.

Input: nums = [7,2,5,10,8], k = 2
Output: 18  (split: [7,2,5] and [10,8])

Approach: Binary search on the answer (range: max(nums) to sum(nums)). For each candidate answer, greedily check if array can be split into at most k parts with each part sum <= candidate. Complexity: Time O(n log(sum)) · Space O(1) Follow-up: Nvidia context: load balancing GPU workloads across streaming multiprocessors.


45. Design a GPU Kernel Launch System Hard

OnsiteSystem Design / GPU Architecture

Design a system for scheduling and executing GPU kernels across multiple CUDA streams on a multi-GPU server. Discuss dependencies, concurrency, and resource management.

Key Design Points:

  • CUDA Streams: kernels within same stream execute sequentially; across streams can overlap
  • CUDA Graphs: capture kernel launch sequences, replay with minimal CPU overhead — key for inference
  • Dependency tracking: use CUDA events (cudaEventRecord, cudaStreamWaitEvent) to enforce cross-stream ordering
  • Multi-GPU: NVLink for direct GPU-GPU transfers, peer-to-peer memory access
  • MPS (Multi-Process Service): share GPU across multiple processes with context isolation
  • Occupancy: balance thread blocks per SM — too few wastes parallelism, too many causes register spilling

Expected discussion: Trade-off between fine-grained stream parallelism and synchronization overhead. How to maximize SM utilization while respecting memory bandwidth limits. Follow-up: How would you extend this for a training cluster with thousands of GPUs?


46. All O`one Data Structure Hard

OnsiteDesign / Doubly Linked ListLeetCode: &#35;432

Design a data structure to store strings with O(1) inc/dec operations and O(1) getMaxKey/getMinKey.

Input: inc("a"), inc("b"), inc("b"), inc("c"), inc("c"), inc("c"), getMaxKey()->"c", getMinKey()->"a"

Approach: Doubly linked list of buckets (each bucket = a frequency level with a set of keys). HashMap from key to its bucket node. Inc: move key to bucket+1 (create if not exists). Dec: move to bucket-1. Head = min freq, tail = max freq. Complexity: Time O(1) all operations · Space O(n) Follow-up: How would you adapt this for tracking GPU kernel execution frequencies?


47. Maximum Gap Hard

OnsiteBucket Sort / Radix SortLeetCode: &#35;164

Given an unsorted array, return the maximum difference between successive elements in its sorted form. Must run in O(n) time and space.

Input: nums = [3,6,9,1]
Output: 3

Approach: Bucket sort / pigeonhole principle. With n elements, create n-1 buckets between min and max. Each bucket stores only min and max. Maximum gap must span at least one empty bucket — compare adjacent bucket min/max values. Complexity: Time O(n) · Space O(n) Follow-up: How is radix sort used in GPU sorting (e.g., CUB library's DeviceRadixSort)?


48. Parallel Prefix Sum (Scan) Hard

OnsiteParallel Algorithms / GPU

Implement the parallel prefix sum (exclusive scan) algorithm. Given an array A of n elements, compute output[i] = A[0] + A[1] + ... + A[i-1] (exclusive scan) using parallel operations.

Input:  [3, 1, 7, 0, 4, 1, 6, 3]
Output: [0, 3, 4, 11, 11, 15, 16, 22]

Approach (Blelloch Scan):

  1. Up-sweep (reduce): pairwise summation up a binary tree — log n parallel steps
  2. Set root to 0
  3. Down-sweep: distribute prefix sums down the tree — log n more steps

Total: O(n) work, O(log n) depth — optimal for GPU.

Nvidia context: Prefix scan is a fundamental GPU primitive used in stream compaction, sorting (radix sort carries), histogram building, and sparse matrix operations. Understanding it is expected for GPU software roles. Follow-up: How does warp-level scan using __shfl_down_sync differ from shared memory scan?


49. Minimum Number of Refueling Stops Hard

OnsiteGreedy / Max-HeapLeetCode: &#35;871

A car starts with startFuel liters and travels to a target distance. Fuel stations are at given positions with given fuel amounts. Return the minimum number of stops, or -1 if impossible.

Input: target = 100, startFuel = 10, stations = [[10,60],[20,30],[30,30],[60,40]]
Output: 2

Approach: Greedy + max-heap. Travel as far as possible. When stuck, retroactively refuel at the station with the most fuel (greedily pick the best past option using a max-heap). This simulates optimal choices without trying all combinations. Complexity: Time O(n log n) · Space O(n) Follow-up: Generalize to a vehicle with limited tank capacity.


50. Design a Compiler Symbol Table Hard

OnsiteDesign / Compiler Systems

Design a symbol table for a compiler that supports: declaring variables in nested scopes, looking up the nearest enclosing declaration of a variable, and exiting/entering scopes efficiently.

Key Design Points:

  • Scoped hash table: stack of HashMaps — each scope gets its own map
  • Enter scope: push a new HashMap onto the stack
  • Exit scope: pop the top HashMap
  • Declare(name, type): insert into top map; detect re-declaration in same scope
  • Lookup(name): search from top to bottom of stack — first match is nearest scope
  • Optimization: chained hash tables (each scope points to parent) for O(1) push/pop instead of O(depth) lookup

Nvidia context: NVCC (Nvidia's CUDA compiler) maintains symbol tables for host and device code separately. Variable qualifiers (__device__, __shared__, __constant__) add extra scope/namespace dimensions. Follow-up: How would you handle template instantiation in C++ symbol tables? What about CUDA cooperative groups and their scoping rules?


Preparation Tips

Nvidia's interviews are technically deep, especially for GPU and systems roles. Here is how to prepare effectively:

For Coding Rounds:

  • Master arrays, strings, trees, graphs, DP, and heap problems at Medium-Hard level
  • Practice writing clean, modular C++ code — Nvidia strongly prefers C++ candidates
  • Know STL containers: priority_queue, unordered_map, deque, set
  • Focus on bit manipulation — very common for GPU/embedded roles
  • Study monotonic stacks and deques — appear frequently in Nvidia onsite rounds

For Systems/Architecture Rounds:

  • Understand CPU/GPU memory hierarchy: registers, shared memory, L1/L2 cache, DRAM
  • Learn basic CUDA concepts: thread/block/grid hierarchy, warp execution, memory coalescing
  • Study parallel algorithms: prefix scan, parallel reduction, parallel sort
  • Know OS fundamentals: memory management, synchronization primitives, process scheduling
  • Practice system design: distributed key-value stores, rate limiters, task schedulers

For All Rounds:

  • Think out loud — Nvidia interviewers value your reasoning process as much as the solution
  • Always discuss time and space complexity and potential optimizations
  • Ask clarifying questions about constraints and edge cases upfront
  • Be prepared to code in C++ — not just Python