man israndom (1): randomness testing using data compressors over fixed-size alphabets

SYNOPSIS

israndom [-a alphasize] [-c compressor] [-s samplelen] [-qhnr] [filename]

DESCRIPTION

israndom tests a sequence of symbols for randomness. israndom tries to determine if a given sequence of trials could reasonably be assumed to be from a random uniform distribution over a fixed-size alphabet of 2-256 symbols.

israndom assumes that each sequence (or sample trial) is represented by exactly one byte. The only exceptions to this rule are in the case of the: -n and -r options which ignore newlines and carriage returns, respectively (see below).
israndom is based on the mathematical ideas of Shannon, Kolmogorov, and Cilibrasi and uses the following formula to determine an expected size for a sample of: k trials of a uniform distribution over an alphasize- symbol alphabet. Each symbol takes log(alphasize) bits, so the total cost (in bits) c for the ensemble of samples is k log(alphasize) bits. This number is rounded up to the nearest byte and increased by one to arrive at the final estimate of the expected communication cost on the assumption of uniform randomness.
If the compressed size of: k samples is less than c then this represents a randomness deficiency and the randomness test fails. israndom will exit with a nonzero exit status. If israndom indicates that a source is nonrandom, this fact is effectively certain if the compression module is correct and invertable. If the compressed size is at least the threshhold value c then the file appears to be random and passes the test and israndom will exit with a 0 return value. In either case, it will print the alphabet size, expected compressed size, sample count, and randomness difference before exitting with an appropriate return code.
The default number of samples is 393216. Although larger sizes should increase accuracy, using too few samples will cause the method to fail to be able to resolve randomness in certain situations. This is a theoretically unavoidable fact for all effective randomness tests.
If a filename is given, it is read to find the samples to analyze. If the filename "-" is given, or no filename is given at all, then israndom reads from standard input.
If text files are to be used, it is important to specify one or both of -n and -r since without these, end of line characters will be misinterpreted as samples.

OPTIONS

-c compressor_name: set compressor explicitly to compressor_name instead of the default, bzlib. For basic analysis, bzlib is usually sufficient. For detecting complex or subtle biases, a more powerful compression module such as lzma (lzmax) or ppmd (ppmdx) will detect more types of non-randomness. Because Lempel-Ziv types are universal, all effective randomness tests can be captured as a kind of compression discriminant function.
-n: ignore newlines (so that text files may be used)
-r: ignore carriage returns (so that text files may be used)
-a alphasize: set alphabet size to alphasize an integer between 2 and 256. If you do not specify an alphabet size, it is automatically determined by the contents of the samples.
-s samplecount: Use samplecount samples instead of the default of 393216. Using a number that is too small here will reduce the accuracy of the test, causing everything to appear to be random. If 0 is used, it means to read until EOF.
-q: quiet mode, with no extra status messages
-h: print help and exit.

EXAMPLES

First, we can verify that the cryptographicly strong random number generator is correct:

israndom /dev/urandom
Next, we can notice that the "od" command, without extra options, is not random because it prints out addresses and spaces predictably. Most compressors can tell by the regular spaces that it is not random:
od /dev/urandom | israndom -n -r
but if we remove spaces using 'tr' then a more powerful compressor, lzmax, is required to demonstrate the non-randomness of the sequence:
od /dev/urandom | tr -d ' ' | israndom -n -r -c lzmax
Removing the address lines using an: od option yields the expected result once again that the sequence is effectively random:
od -An /dev/urandom | tr -d ' ' | israndom -n -r -c lzmax
The above sequence is not actually random, because every third octal digit: only ranges from 0 to 3 since 377 octal is the same as 256 decimal. This subtle pattern is detectable using 10 million samples and the advanced ppmdx compressor:
od -An /dev/urandom | tr -d ' ' | israndom -n -r -c ppmdx -s 10000000
As a sanity check, we see that even in extreme analysis as above, /dev/urandom: still checks out okay as random, even with newlines and carriage returns removed for good measure.
cat /dev/urandom | israndom -n -r -c ppmdx -s 10000000

ENVIRONMENT

No environment variables.

BUGS

Please report bugs to the Debian BTS.

AUTHOR

Rudi Cilibrasi <[email protected]>