MapReduce parallel processing of all known URLs for median length
You are given the data set of all Google crawled URLs on the Internet -- that's a very large set. Write an algorithm to find the median URL length.
You are given the data set of all crawled URLs on the Internet. Assume there are N of them. Obviously N is so incredibly and astonishingly large that the set cannot be stored on one machine; so we've employed S very cheap machines and sharded the data set across them. Each data point (i.e. URL string) is only stored on one, perhaps randomly chosen, machine. Since the capacity of each of these cheap machines is relatively low, assume S itself is an incredibly large number. You're given a development machine of your own.
Design an algorithm to find the median of all known Internet URL lengths.
Note: 15 minutes into the problem I'm going to have to remind you to read all little assumptions carefully. So do this ahead of time. For example, I shouldn't have to tell you "Median and mean are different things."
Paul from Google
Google search results for the top 1000 highest ranking URLs
You have trillions of URLs stored uniquely, without order, across numerous machines, each with a score. Find the 1000 URLs with the highest scores.
Map Reduce Data Mining and Ranking
Given a Google Search Query, find the AdWords Categories best describing it
Extreme Recursion with Binary Tree Topologies
How many binary tree topologies are possible given n nodes?
Sort a million 32-bit integers
What is the most efficient way to sort a million 32-bit integers?
- Amin Ariana
- February 2011