MapReduce parallel processing of all known URLs for median length

You are given the data set of all Google crawled URLs on the Internet -- that's a very large set. Write an algorithm to find the median URL length.

Problem Statement

You are given the data set of all crawled URLs on the Internet. Assume there are N of them. Obviously N is so incredibly and astonishingly large that the set cannot be stored on one machine; so we've employed S very cheap machines and sharded the data set across them. Each data point (i.e. URL string) is only stored on one, perhaps randomly chosen, machine. Since the capacity of each of these cheap machines is relatively low, assume S itself is an incredibly large number. You're given a development machine of your own.

Design an algorithm to find the median of all known Internet URL lengths.

Note: 15 minutes into the problem I'm going to have to remind you to read all little assumptions carefully. So do this ahead of time. For example, I shouldn't have to tell you "Median and mean are different things."


Paul from Google

Recommended reading

Amin Ariana
February 2011