Google search results for the top 1000 highest ranking URLs

You have trillions of URLs stored uniquely, without order, across numerous machines, each with a score. Find the 1000 URLs with the highest scores.

Problem Statement

Design Google!

You have trillions of URLs stored uniquely, without order, across numerous machines -- as in each machine has only a partition of the data. Each URL has an associated score. Describe an algorithm to find the 1000 URLs with the highest scores.

Note: To actually fully emulate Google Search Engine, you'd need to first filter the results by relevance to a query, but we're not asking that here. If you can solve the simpler problem as it is stated, you can also solve the more complex problem.

# Input example on one of the machines
www.google.com, 100
www.foobar.com, 2
www.example.com, 10

Evaluation

  1. The ability to distinguish between sorting and finding a max in a list (25%)
  2. The ability to get the Top 1000 elements in any list in linear time (25%)
  3. The ability to describe mapping the algorithm across numerous machines and reducing the results in a merge step (25%)
  4. Describing the time-complexity of the solution in O-notation (25%)

Solution

Single machine solution (map step)

First, at the single machine level, observe that sorting would be way too expensive. The input lives on disk. There's not enough memory to juggle the data for sorting. Also observe that the maximum value in any array can be found linearly, without any sorting.

The problem is we're looking for the K max values. This part of the problem tests the familiarity of the candidate with data structures. A (max- or) min-heap data structure is a binary tree that guarantees the (largest or) smallest node to be at the root. It sorts itself at insert time. For the heap of size n, it takes log(n) time to insert a new value, and constant time to pop the (maximum or) minimum.

Heap sort basically works by putting all of the given unordered values into a min-heap and then popping them one by one from the root. Because the root is guaranteed to be the minimum value at any given time, the order in which the values come out will be sorted. This takes n*log(n) time, for n values each taking log(n) time to insert.

But we already mentioned there's no enough memory per machine to keep the entire data in a heap. Fret not, for we don't need the entire data. We need only the 1000 maximum values. Let's say every time we insert a value into our min-heap, we check that its size hasn't exceeded 1000 nodes. If it has, we pop it. The entire memory we need for the heap will be for 1000 nodes this way, and we're guaranteed to retain only the maximum values inside the heap by the time we run out of data to stream through the heap. Voila! 1000 highest scored nodes on that machine can be popped off the heap once the data runs out.

Multi-machine solution (reduce step)

Observe that if you have a giant farm of machines, you can have each of them in parallel perform this 1000-best results locally. The worst case performance is only as bad as the weakest machine. Now you have many machines, each having calculated its own 1000 most popular URLs. Once that's done, you can collect all these result sets from the "mapping" step into a centralized machine called "reducer" (this is known as shuffling the mapped data)

Your reducer only has to find the most popular 1000 URLs in a set of 1000 x M elements, where M is the number of mapper machines above. Well that's a familiar problem we already solved: Use another heap at this stage (i.e. re-run the code from the previous task) to find the best 1000 of the best 1000s.

We just did Map Reduce.

In real (Google) life, the number of mapper machines is so many that having one reducer becomes a bottleneck in its own right. So you'll have multiple reducers, and you might end up having multiple generations of the reduce step. It's okay. The reduce code will be re-usable for all of these hierarchical reductions implicitly. It's kind of like managing people. Once you know how to manage engineers, you can also manage managers, and if need be, managers' managers using the same algorithm.

Complexity Calculations

The complexity of the correct answer is n log (k) where n is the number of URLs and k is the constant size of the heap. In this case, k is 1000, so the complexity is n log (1000). Note that the binary log of 1000 is a constant (just below ~ 10, since 2 ^ 10 is 1024). So the solution of O(10n), equivalent to O(n), has linear complexity.

Throw caching per query on top of that, and you'll have one fast Search Engine called Google!

Interviewer's Perspective

  • The weak candidate starts talking about sorting the whole data as if it's the solution. This immediately reveals a lack of experience with scalability. That's okay. The interviewer will pull the candidate back and point out that there's not enough memory to sort anything of any kind. If the weak candidate insists on sorting, the interview is pretty much over soon.
  • The intermediate candidate understands that finding the maximum value in an unsorted list takes linear time. But he may struggle doing this K times. He may also struggle with parallel processing by trying to find each max across the whole cluster of machines, repeatedly, leading to 1000 reduction steps. His answer's performance is "Network-bound", i.e. bottlenecked by the networking capacity between the machines, and super slow.
  • The strong candidate quickly offers a max-heap solution to the top-K problem. He may struggle a little with the Map Reduce part, but he'll figure out how to re-use the min-heap trick at the reduce stage. Re-using is important to him. He'll have one map and one reduce execution time.

References

Val during interview at Google

Recommended reading

Author
Amin Ariana
Published
January 2014