SYNOPSIS
datamash [OPTION] op [col] [op col ...]DESCRIPTION
Performs numeric/string operations on input from stdin.'op' is the operation to perform; For grouping,per-line operations 'col' is the input field to use; 'col' can be a number (1=first field), or a column name when using -H or --header-in options.
File operations:
- transpose, reverse
Line-Filtering operations:
- rmdup
Per-Line operations:
- base64, debase64, md5, sha1, sha256, sha512
Numeric Grouping operations:
- sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
- count, first, last, rand, unique, collapse, countunique
Statistical Grouping operations:
- mean, median, q1, q3, iqr, mode, antimode, pstdev, sstdev, pvar svar, mad, madraw, pskew, sskew, pkurt, skurt, dpo, jarque
OPTIONS
Grouping Options:
- -f, --full
- print entire input line before op results (default: print only the grouped keys)
- -g, --group=X[,Y,Z]
- group via fields X,[Y,Z]
- --header-in
- first input line is column headers
- --header-out
- print column headers as first line
- -H, --headers
- same as '--header-in --header-out'
- -i, --ignore-case
- ignore upper/lower case when comparing text; this affects grouping, and string operations
- -s, --sort
- sort the input before grouping; this removes the need to manually pipe the input through 'sort'
File Operation Options:
- --no-strict
- allow lines with varying number of fields
- --filler=X
- fill missing values with X (default %s)
General Options:
- -t, --field-separator=X
- use X instead of TAB as field delimiter
- --narm
- skip NA/NaN values
- -W, --whitespace
- use whitespace (one or more spaces and/or tabs) for field delimiters
- -z, --zero-terminated
- end lines with 0 byte, not newline
- --help
- display this help and exit
- --version
- output version information and exit
AVAILABLE OPERATIONS
File operations:
- transpose
- transpose rows, columns of the input file
- reverse
- reverse field order in each line
Line-Filtering operations:
- rmdup
- remove lines with duplicated key value
Per-Line operations:
- base64
- Encode the field as base64
- debase64
- Decode the field as base64, exit with error if invalid base64 string
- md5/sha1/sha256/sha512
- Calculate md5/sha1/sha256/sha512 hash of the field value
- reverse
- reverse field order in each line
Numeric Grouping operations
- sum
- sum the of values
- min
- minimum value
- max
- maximum value
- absmin
- minimum of the absolute values
- absmax
- maximum of the absolute values
Textual/Numeric Grouping operations
- count
- count number of elements in the group
- first
- the first value of the group
- last
- the last value of the group
- rand
- one random value from the group
- unique
- comma-separated sorted list of unique values
- collapse
- comma-separated list of all input values
- countunique
- number of unique/distinct values
Statistical Grouping operations
- mean
- mean of the values
- median
- median value
- q1
- 1st quartile value
- q3
- 3rd quartile value
- iqr
- inter-quartile range
- mode
- mode value (most common value)
- antimode
- anti-mode value (least common value)
- pstdev
- population standard deviation
- sstdev
- sample standard deviation
- pvar
- population variance
- svar
- sample variance
- mad
- median absolute deviation, scaled by constant 1.4826 for normal distributions
- madraw
- median absolute deviation, unscaled
- sskew
- skewness of the (sample) group
- pskew
-
skewness of the (population) group
values x reported by 'sskew' and 'pskew' operations:x > 0 - positively skewed / skewed right 0 > x - negatively skewed / skewed left x > 1 - highly skewed right 1 > x > 0.5 - moderately skewed right 0.5 > x > -0.5 - approximately symmetric -0.5 > x > -1 - moderately skewed left -1 > x - highly skewed left
- skurt
- excess Kurtosis of the (sample) group
- pkurt
- excess Kurtosis of the (population) group
- jarque
- p-value of the Jarque-Beta test for normality
- dpo
-
p-value of the D'Agostino-Pearson Omnibus test for normality;
for 'jarque' and 'dpo' operations:
null hypothesis is normality;
low p-Values indicate non-normal data;
high p-Values indicate null-hypothesis cannot be rejected.
EXAMPLES
Print the sum and the mean of values from column 1:
- $ seq 10 | datamash sum 1 mean 1 55 5.5
Group input based on field 1, and sum values (per group) on field 2:
- $ cat example.txt A 10 A 5 B 9 B 11 $ datamash -g 1 sum 2 < example.txt A 15 B 20
Unsorted input must be sorted (with '-s'):
- $ cat example.txt A 10 C 4 B 9 C 1 A 5 B 11 $ datamash -s -g1 sum 2 < example.txt A 15 B 20 C 5
Which is equivalent to:
- $ cat example.txt | sort -k1,1 | datamash -g 1 sum 2
Use -h (--headers) if the input file has a header line:
- # Given a file with student name, field, test score... $ head -n5 scores_h.txt Name Major Score Shawn Engineering 47 Caleb Business 87 Christian Business 88 Derek Arts 60 # Calculate the mean and standard devian for each major $ datamash --sort --headers --group 2 mean 3 pstdev 3 < scores_h.txt (or use short form) $ datamash -sH -g2 mean 3 pstdev 3 < scores_h.txt (or use named columns) $ datamash -sH -g Major mean Score pstdev Score < scores_h.txt GroupBy(Major) mean(Score) pstdev(Score) Arts 68.9 10.1 Business 87.3 4.9 Engineering 66.5 19.1 Health-Medicine 90.6 8.8 Life-Sciences 55.3 19.7 Social-Sciences 60.2 16.6
Reverse field order in each line:
- $ seq 6 | paste - - | datamash reverse 2 1 4 3 6 5
Transpose rows, columns:
- $ seq 6 | paste - - | datamash transpose 1 3 5 2 4 6
Remove lines with duplicate key value from column 1 (Unlike first,last operations, rmdup is much faster and does not require sorting the file with -s):
- # Given a list of files and sample IDs: $ cat INPUT SampleID File 2 cc.txt 3 dd.txt 1 ab.txt 2 ee.txt 3 ff.txt # Remove lines with duplicated Sample-ID (column 1): $ datamash rmdup 1 < INPUT (or used named column) $ datamash -H rmdup SampleID < INPUT SampleID File 2 cc.txt 3 dd.txt 1 ab.txt
Calculate the sha1 hash value of each TXT file, after calculating the sha1 value of each file's content:
- $ sha1sum *.txt | datamash -Wf sha1 2
ADDITIONAL INFORMATION
See GNU Datamash Website (http://www.gnu.org/software/datamash)AUTHOR
Written by Assaf Gordon.COPYRIGHT
Copyright © 2015 Assaf Gordon License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.