MapReduce parallel processing of all known URLs for median length
You are given the data set of all Google crawled URLs on the Internet  that's a very large set. Write an algorithm to find the median URL length.
Problem Statement
You are given the data set of all crawled URLs on the Internet. Assume there are N of them. Obviously N is so incredibly and astonishingly large that the set cannot be stored on one machine; so we've employed S very cheap machines and sharded the data set across them. Each data point (i.e. URL string) is only stored on one, perhaps randomly chosen, machine. Since the capacity of each of these cheap machines is relatively low, assume S itself is an incredibly large number. You're given a development machine of your own.
Design an algorithm to find the median of all known Internet URL lengths.
Note: 15 minutes into the problem I'm going to have to remind you to read all little assumptions carefully. So do this ahead of time. For example, I shouldn't have to tell you "Median and mean are different things."
References
Paul from Google
Recommended reading

Google search results for the top 1000 highest ranking URLs
You have trillions of URLs stored uniquely, without order, across numerous machines, each with a score. Find the 1000 URLs with the highest scores.

Map Reduce Data Mining and Ranking
Given a Google Search Query, find the AdWords Categories best describing it

Extreme Recursion with Binary Tree Topologies
How many binary tree topologies are possible given n nodes?

Sort a million 32bit integers
What is the most efficient way to sort a million 32bit integers?
 Author
 Amin Ariana
 Published
 February 2011