Sampling from a Stream

Selecting a random element from an array of length n is easy: simply generate a random integer i, with 0 <= i < n, and use the array element at that index position. But what if the length of the array is not known beforehand, or is, in fact, infinite (i.e. a stream)? And what if we don’t just want a single element, but a set of m samples, without replacement?

An Inductive Argument

Imagine we are not able to access elements in the input data randomly, but that we have to access them sequentially — for example because we are dealing with a list, or because we need to apply some sort of filter before making our choice. As we iterate over the elements in the input, at each step we must decide whether to keep the current element as the sample. With what probability should we retain an element?

The answer is easy if we work backwards: imagine we have already consumed the entire input, except for the last element. Since we must select a sample, we have to retain the last element with probability 1. Now back up a step, so that there are two elements left in the input. We must select either one of these, with equal probability; hence we should retain the second-to-last element with probability 1/2. If three elements are left, we should select any one of them with equal probability, leading to a probability of 1/3. In general, we should retain the element with index i with probability $1/(n-i)$. Note that we do need to know the total number of elements n in the sample.

The important take-away here is the nature of the argument: starting at the end, and working backwards. This idea generalizes to more complicated situations.

Sampling Without Replacement

Imagine that we would like to draw m samples from an input of n elements without replacement, meaning that each input element may only be selected once. One way to do this is to remove each selected element from the array, but this destroys the array, and is slow (because of the need to copy elements the input array as the array shrinks). Another destructive, but potentially slightly faster method would be to perform a random shuffle of m elements and use them.

To find the correct non-destructive, iterative algorithm unfortunately requires some combinatorial calisthenics, although the result is pleasingly simple.

Consider an array of n elements, from which we want to take a sample of m items. The number of distinct samples of m elements from a population of n is given by the binominal coefficient:

\[ \binom{n}{m} = \frac{n!}{m! (n-m)!} \]

Now consider the case that we do not select the first element in the array. In this case, we still must select m elements, but from a smaller population of n-1 items. The number of such samples is $\binom{n-1}{m}$.

We can now ask: what is the probability that a sample of size m, taken from an array of n elements, does not contain the first element? The probability is the number of favorable outcomes, divided by the total number of possible outcomes:

\[ q = \frac{\text{number of samples without first element}}{\text{total number of samples}} = \frac{ \binom{n-1}{m} }{ \binom{n}{m} } = \frac{n-m}{n} = 1 - \frac{m}{n} \]

This is the probability that a sample does not contain the first element. The probability that a sample does contain the first element, which is the probability with which the first element should be retained, is therefore:

\[ p = 1-q = \frac{m}{n} \]

This formula applies “recursively” at each step, where n and m always indicate the number of remaining array elements and the number of samples still to draw. In an iterative implementation, the probability would be $p = (m-k)/(n-i)$, where m and n are now kept constant, while i signifies the index of the current element, and k counts the number of elements already selected for the sample. Notice that the formula (together with its derivation) reduces to the previous result for a sample size of 1.

Stream Sampling

What has been common to the algorithms introduced so far is that they required the size of the input to be known ahead of time. But what if the size is unknown, or is effectively unlimited, for example when reading from a stream?

Let’s say that we want to select one item from an (in principle unlimited) stream. We have no information of the number of elements in the stream; it may in fact be quite short. Yet, we want an algorithm that is guaranteed to produce a “best guess” single-item sample from the stream at all times.

The inductive argument presented earlier carries over the present case, only that now we start counting at the beginning. We retain the first element read from the stream with probability 1: this way, we are guaranteed to have a sample. When we read the second element, we retain it with probability 1/2, and — if we retain it — use it to replace the previous sample. When the third element is read, we have seen a total of 3 elements to choose from, and hence we retain it with probability 1/3 and use it to replace whatever is the current sample. And so on.

Reservoir Sampling

Reservoir sampling is a generalization of the previous stream sampling algorithm to samples of more than one element. A naive generalization might look like this: fill the “reservoir” (or sample) with the first m elements read from the stream; then retain each newly read element with probability 1/(i+1), and use it to replace a randomly selected item in the reservoir. (Using a zero-based index i to count the elements read from the stream.) The problem with this naive algorithm is that it requires two random numbers at each step: one to decide whether to retain an element, the other to determine which element in the sample to replace.

Reservoir sampling makes do with a single random number, as follows. When reading the element with index i, generate a random integer r from the range 0 <= r < i. (In other words, if all the elements read so far from the stream had been placed into an array, generate a random index into that array.) If the random integer r obeys 0 <= r < m, that is, if r identifies a element in the (zero-based) reservoir, then retain the current value read from the stream and place it into the reservoir at position r; otherwise, do not retain the item read from the stream.

In the same way that single-item stream sampling generalizes our original inductive argument, reservoir sampling applies the same logic to samples of multiple elements.