Google search results for the top 1000 highest ranking URLs
You have trillions of URLs stored uniquely, without order, across numerous machines, each with a score. Find the 1000 URLs with the highest scores.
Problem Statement
Design Google!
You have trillions of URLs stored uniquely, without order, across numerous machines  as in each machine has only a partition of the data. Each URL has an associated score. Describe an algorithm to find the 1000 URLs with the highest scores.
Note: To actually fully emulate Google Search Engine, you'd need to first filter the results by relevance to a query, but we're not asking that here. If you can solve the simpler problem as it is stated, you can also solve the more complex problem.
# Input example on one of the machines
www.google.com, 100
www.foobar.com, 2
www.example.com, 10
Evaluation
 The ability to distinguish between sorting and finding a max in a list (25%)
 The ability to get the Top 1000 elements in any list in linear time (25%)
 The ability to describe mapping the algorithm across numerous machines and reducing the results in a merge step (25%)
 Describing the timecomplexity of the solution in Onotation (25%)
Solution
Single machine solution (map step)
First, at the single machine level, observe that sorting would be way too expensive. The input lives on disk. There's not enough memory to juggle the data for sorting. Also observe that the maximum value in any array can be found linearly, without any sorting.
The problem is we're looking for the K max values. This part of the problem tests the familiarity of the candidate with data structures. A (max or) minheap data structure is a binary tree that guarantees the (largest or) smallest node to be at the root. It sorts itself at insert time. For the heap of size n, it takes log(n) time to insert a new value, and constant time to pop the (maximum or) minimum.
Heap sort basically works by putting all of the given unordered values into a minheap and then popping them one by one from the root. Because the root is guaranteed to be the minimum value at any given time, the order in which the values come out will be sorted. This takes n*log(n) time, for n values each taking log(n) time to insert.
But we already mentioned there's no enough memory per machine to keep the entire data in a heap. Fret not, for we don't need the entire data. We need only the 1000 maximum values. Let's say every time we insert a value into our minheap, we check that its size hasn't exceeded 1000 nodes. If it has, we pop it. The entire memory we need for the heap will be for 1000 nodes this way, and we're guaranteed to retain only the maximum values inside the heap by the time we run out of data to stream through the heap. Voila! 1000 highest scored nodes on that machine can be popped off the heap once the data runs out.
Multimachine solution (reduce step)
Observe that if you have a giant farm of machines, you can have each of them in parallel perform this 1000best results locally. The worst case performance is only as bad as the weakest machine. Now you have many machines, each having calculated its own 1000 most popular URLs. Once that's done, you can collect all these result sets from the "mapping" step into a centralized machine called "reducer" (this is known as shuffling the mapped data)
Your reducer only has to find the most popular 1000 URLs in a set of 1000 x M elements, where M is the number of mapper machines above. Well that's a familiar problem we already solved: Use another heap at this stage (i.e. rerun the code from the previous task) to find the best 1000 of the best 1000s.
We just did Map Reduce.
In real (Google) life, the number of mapper machines is so many that having one reducer becomes a bottleneck in its own right. So you'll have multiple reducers, and you might end up having multiple generations of the reduce step. It's okay. The reduce code will be reusable for all of these hierarchical reductions implicitly. It's kind of like managing people. Once you know how to manage engineers, you can also manage managers, and if need be, managers' managers using the same algorithm.
Complexity Calculations
The complexity of the correct answer is n log (k) where n is the number of URLs and k is the constant size of the heap. In this case, k is 1000, so the complexity is n log (1000). Note that the binary log of 1000 is a constant (just below ~ 10, since 2 ^ 10 is 1024). So the solution of O(10n), equivalent to O(n), has linear complexity.
Throw caching per query on top of that, and you'll have one fast Search Engine called Google!
Interviewer's Perspective
 The weak candidate starts talking about sorting the whole data as if it's the solution. This immediately reveals a lack of experience with scalability. That's okay. The interviewer will pull the candidate back and point out that there's not enough memory to sort anything of any kind. If the weak candidate insists on sorting, the interview is pretty much over soon.
 The intermediate candidate understands that finding the maximum value in an unsorted list takes linear time. But he may struggle doing this K times. He may also struggle with parallel processing by trying to find each max across the whole cluster of machines, repeatedly, leading to 1000 reduction steps. His answer's performance is "Networkbound", i.e. bottlenecked by the networking capacity between the machines, and super slow.
 The strong candidate quickly offers a maxheap solution to the topK problem. He may struggle a little with the Map Reduce part, but he'll figure out how to reuse the minheap trick at the reduce stage. Reusing is important to him. He'll have one map and one reduce execution time.
References
Val during interview at Google
Recommended reading

MapReduce parallel processing of all known URLs for median length
You are given the data set of all Google crawled URLs on the Internet  that's a very large set. Write an algorithm to find the median URL length.

Map Reduce Data Mining and Ranking
Given a Google Search Query, find the AdWords Categories best describing it

Sort a million 32bit integers
What is the most efficient way to sort a million 32bit integers?

Rulebased HTML compilation
Given an HTML with messed up tag nesting, output the correct and unambiguous version of the HTML.
 Author
 Amin Ariana
 Published
 January 2014