randomness testing using data compressors over fixed-size alphabets
israndom [-a alphasize] [-c compressor] [-s samplelen] [-qhnr] [filename]
tests a sequence of symbols for randomness. israndom tries to determine if a given sequence of trials could reasonably be assumed to be from a random uniform distribution over a fixed-size alphabet of 2-256 symbols.
- israndom assumes that each sequence (or sample trial) is represented by exactly one byte. The only exceptions to this rule are in the case of the
options which ignore newlines and carriage returns, respectively (see below).
- israndom is based on the mathematical ideas of Shannon, Kolmogorov, and Cilibrasi and uses the following formula to determine an expected size for a sample of
trials of a uniform distribution over an
Each symbol takes
bits, so the total cost (in bits)
for the ensemble of samples is
bits. This number is rounded up to the nearest byte and increased by one to arrive at the final estimate of the expected communication cost on the assumption of uniform randomness.
- If the compressed size of
samples is less than
then this represents a
and the randomness test fails. israndom will exit with a nonzero exit status. If israndom indicates that a source is nonrandom, this fact is effectively certain if the compression module is correct and invertable. If the compressed size is at least the threshhold value
then the file appears to be random and passes the test and israndom will exit with a 0 return value. In either case, it will print the alphabet size, expected compressed size, sample count, and randomness difference before exitting with an appropriate return code.
- The default number of samples is 393216. Although larger sizes should increase accuracy, using too few samples will cause the method to fail to be able to resolve randomness in certain situations. This is a theoretically unavoidable fact for all effective randomness tests.
- If a filename is given, it is read to find the samples to analyze. If the filename "-" is given, or no filename is given at all, then israndom reads from standard input.
- If text files are to be used, it is important to specify one or both of -n and -r since without these, end of line characters will be misinterpreted as samples.
- -c compressor_name
set compressor explicitly to compressor_name instead of the default, bzlib. For basic analysis, bzlib is usually sufficient. For detecting complex or subtle biases, a more powerful compression module such as lzma (lzmax) or ppmd (ppmdx) will detect more types of non-randomness. Because Lempel-Ziv types are universal, all effective randomness tests can be captured as a kind of compression discriminant function.
ignore newlines (so that text files may be used)
ignore carriage returns (so that text files may be used)
- -a alphasize
set alphabet size to alphasize an integer between 2 and 256. If you do not specify an alphabet size, it is automatically determined by the contents of the samples.
- -s samplecount
Use samplecount samples instead of the default of 393216. Using a number that is too small here will reduce the accuracy of the test, causing everything to appear to be random. If 0 is used, it means to read until EOF.
quiet mode, with no extra status messages
print help and exit.
First, we can verify that the cryptographicly strong random number generator is correct:
- israndom /dev/urandom
- Next, we can notice that the "od" command, without extra options, is not random because it prints out addresses and spaces predictably. Most compressors can tell by the regular spaces that it is not random:
- od /dev/urandom | israndom -n -r
- but if we remove spaces using 'tr' then a more powerful compressor, lzmax, is required to demonstrate the non-randomness of the sequence:
- od /dev/urandom | tr -d ' ' | israndom -n -r -c lzmax
- Removing the address lines using an
option yields the expected result once again that the sequence is effectively random:
- od -An /dev/urandom | tr -d ' ' | israndom -n -r -c lzmax
- The above sequence is not actually random, because every third octal digit
only ranges from 0 to 3 since 377 octal is the same as 256 decimal. This
subtle pattern is detectable using 10 million samples and the advanced
- od -An /dev/urandom | tr -d ' ' | israndom -n -r -c ppmdx -s 10000000
- As a sanity check, we see that even in extreme analysis as above, /dev/urandom
still checks out okay as random, even with newlines and carriage returns
removed for good measure.
- cat /dev/urandom | israndom -n -r -c ppmdx -s 10000000
No environment variables.
Please report bugs to the Debian BTS.