Find a missing number from huge list of consecutive numbers (Interview Question)

You have a huge (> 10GB) text file in HDFS containing consecutive natural numbers from which one number is missing.
You have to find that missing number most efficiently.

You are free to use any framework of choice.

How about this one?

numsstr = sc.textFile("/data/1lac");
nums0 = numsstr.map(lambda x: int(x))
num1 = nums0.flatMap(lambda x: [x, x+1])
extremes = sc.parallelize([1, 100001])
all = num1.union(extremes)
num2 = all.map(lambda x: (x, 1))
result = num2.reduceByKey(lambda x, y: x + y)
onlyexceptions = result.filter(lambda x: x[1] != 2)
onlyexceptions.min()[0]

In this approach, we first generate the next number and then calculate the frequencies of every number. If every number was present, there will become 2 counts of each because of generation. Though in case of begining and end, there would be only one number.

Here is another solution from Ben:

The idea is the same really but like you said sum can overflow or be expensive to compute, so the first step in my approach is to divide the set in many smaller sets, each of them containing consecutive numbers. If you also design the split so that each set should have the same size then only one would have a missing number, you can identify it by counting elements.

E.g: 0 to 2^16-1. Split in 32 partitions by mapping numbers as values to their leftmost 5 bits (it’s a fast operation) as key, you know that all the numbers being sent to the same reducer should be consecutive numbers so you know how many numbers that is and what their sum should be. The count of numbers tells you which partition contains the missing number but you still need the sum to know what number that is. You can disregard those leftmost bits completely and carry the sum over the rest of the bits. Not only do you have only 1/64 numbers to sum, you don’t even need to sum all of their bits. But in case this is still at risk of overflow you can repeat the divide-n-conquer step over the remaining bits. The only thing you need to store, step after step is the key (the leftmost bits)

I suppose you can even do that recursively one bit at a time from the left. Shift operators are built in any CPU and the fastest to carry after bit-reads so you could possibly find a theoretical way to solve this over no matter how large a file (as long as it fits in drive?)

After you split the list of numbers into 32 non overlapping subsets of consecutive number you compute the COUNT in the reducer
They should all have the same count, the one with one less is the set with the missing number
This way you compute the potentially overflowing sum on only N/32 of the numbers
And you don’t even compute the sum on the whole numbers, since they all have the same key, they all have the same leftmost bits so you can discard those. You effectively reduced the problem
E.g if you are summing all numbers between 100000 and 100999 you can only sum 0-999 and add 100000xcount to it