Sampling from a Stream
Selecting a random element from an array of length n
is easy: simply
generate a random integer i
, with 0 <= i < n
, and use the array
element at that index position. But what if the length of the array is
not known beforehand, or is, in fact, infinite (i.e. a stream)? And what
if we don’t just want a single element, but a set of m
samples,
without replacement?
An Inductive Argument
Imagine we are not able to access elements in the input data randomly, but that we have to access them sequentially — for example because we are dealing with a list, or because we need to apply some sort of filter before making our choice. As we iterate over the elements in the input, at each step we must decide whether to keep the current element as the sample. With what probability should we retain an element?
The answer is easy if we work backwards: imagine we have already consumed
the entire input, except for the last element. Since we must select a
sample, we have to retain the last element with probability 1. Now back
up a step, so that there are two elements left in the input. We must select
either one of these, with equal probability; hence we should retain the
second-to-last element with probability 1/2. If three elements are left,
we should select any one of them with equal probability, leading to a
probability of 1/3. In general, we should retain the element with
index i
with probability $1/(n-i)$. Note that we do need to know
the total number of elements n
in the sample.
The important take-away here is the nature of the argument: starting at the end, and working backwards. This idea generalizes to more complicated situations.
Sampling Without Replacement
Imagine that we would like to draw m
samples from an input of n
elements without replacement, meaning that each input element may
only be selected once. One way to do this is to remove each selected
element from the array, but this destroys the array, and is slow
(because of the need to copy elements the input array as the array
shrinks). Another destructive, but potentially slightly faster method
would be to perform a random shuffle of m
elements and use them.
To find the correct non-destructive, iterative algorithm unfortunately requires some combinatorial calisthenics, although the result is pleasingly simple.
Consider an array of n
elements, from which we want to take a sample
of m
items. The number of distinct samples of m
elements from a
population of n
is given by the binominal coefficient:
\[ \binom{n}{m} = \frac{n!}{m! (n-m)!} \]
Now consider the case that we do not select the first element in the
array. In this case, we still must select m
elements, but from a
smaller population of n-1
items. The number of such samples is
$\binom{n-1}{m}$.
We can now ask: what is the probability that a sample of size m
,
taken from an array of n
elements, does not contain the first element?
The probability is the number of favorable outcomes, divided by the
total number of possible outcomes:
\[ q = \frac{\text{number of samples without first element}}{\text{total number of samples}} = \frac{ \binom{n-1}{m} }{ \binom{n}{m} } = \frac{n-m}{n} = 1 - \frac{m}{n} \]
This is the probability that a sample does not contain the first element. The probability that a sample does contain the first element, which is the probability with which the first element should be retained, is therefore:
\[ p = 1-q = \frac{m}{n} \]
This formula applies “recursively” at each step, where n
and m
always indicate the number of remaining array elements and the number
of samples still to draw. In an iterative implementation, the probability
would be $p = (m-k)/(n-i)$, where m
and n
are now kept constant, while
i
signifies the index of the current element, and k
counts the number
of elements already selected for the sample. Notice that the formula
(together with its derivation) reduces to the previous result for a
sample size of 1.
Stream Sampling
What has been common to the algorithms introduced so far is that they required the size of the input to be known ahead of time. But what if the size is unknown, or is effectively unlimited, for example when reading from a stream?
Let’s say that we want to select one item from an (in principle unlimited) stream. We have no information of the number of elements in the stream; it may in fact be quite short. Yet, we want an algorithm that is guaranteed to produce a “best guess” single-item sample from the stream at all times.
The inductive argument presented earlier carries over the present case, only that now we start counting at the beginning. We retain the first element read from the stream with probability 1: this way, we are guaranteed to have a sample. When we read the second element, we retain it with probability 1/2, and — if we retain it — use it to replace the previous sample. When the third element is read, we have seen a total of 3 elements to choose from, and hence we retain it with probability 1/3 and use it to replace whatever is the current sample. And so on.
Reservoir Sampling
Reservoir sampling is a generalization of the previous stream sampling
algorithm to samples of more than one element. A naive generalization
might look like this: fill the “reservoir” (or sample) with the first
m
elements read from the stream; then retain each newly read element
with probability 1/(i+1), and use it to replace a randomly selected item
in the reservoir. (Using a zero-based index i
to count the elements
read from the stream.) The problem with this naive algorithm is that it
requires two random numbers at each step: one to decide whether to
retain an element, the other to determine which element in the sample
to replace.
Reservoir sampling makes do with a single random number, as follows.
When reading the element with index i
, generate a random integer
r
from the range 0 <= r < i
. (In other words, if all the elements
read so far from the stream had been placed into an array, generate a
random index into that array.) If the random integer r
obeys
0 <= r < m
, that is, if r
identifies a element in the (zero-based)
reservoir, then retain the current value read from the stream and place
it into the reservoir at position r
; otherwise, do not retain the
item read from the stream.
In the same way that single-item stream sampling generalizes our original inductive argument, reservoir sampling applies the same logic to samples of multiple elements.