File: datamash.texi

package info (click to toggle)
datamash 1.9-1
links: PTS, VCS
area: main
in suites: trixie
size: 13,600 kB
sloc: ansic: 65,320; sh: 8,982; perl: 5,127; makefile: 250; sed: 16
file content (1601 lines) | stat: -rw-r--r-- 42,017 bytes
parent folder | download | duplicates (2)
\input texinfo @c -*-texinfo-*-
@c %**start of header
@setfilename datamash.info
@include version.texi
@settitle GNU Datamash @value{VERSION}

@c Define a new index for options.
@defcodeindex op
@c Combine everything into one index (arbitrarily chosen to be the
@c concept index).
@syncodeindex op cp
@syncodeindex vr cp
@c %**end of header

@copying
This manual is for GNU Datamash (version @value{VERSION}, @value{UPDATED}),
which provides command-line computations on input files.

Copyright @copyright{} 2014--2021 Assaf Gordon.
Copyright @copyright{} 2022--2025 Timothy Rice.

@quotation
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or
any later version published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts.  A copy of the license is included in the section entitled
``GNU Free Documentation License''.
@end quotation
@end copying
@c If your manual is published on paper by the FSF, it should include
@c the standard FSF Front-Cover and Back-Cover Texts, as given in
@c maintain.texi.

@dircategory Basics
@direntry
* Datamash: (datamash).               datamash
@end direntry

@titlepage
@title GNU Datamash
@subtitle for version @value{VERSION}, @value{UPDATED}
@author GNU Datamash Developers (@email{assafgordon@@gmail.com})
@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage

@contents


@ifnottex
@node Top
@top Datamash

This manual is for GNU Datamash (version @value{VERSION}, @value{UPDATED}),
which provides command-line computations on input files.
@end ifnottex

@menu
* Overview::		General purpose and information.
* Invoking datamash::	How to run @command{datamash}.
* Available Operations::        Available operations in @command{datamash}.
* Statistical Operations::	Statistical operations in @command{datamash}.
* Usage Examples::      Usage Examples.
* Reporting bugs::	Sending bug reports and feature suggestions.
* GNU Free Documentation License:: Copying and sharing this documentation.
* Concept index::	Index of concepts.


@end menu

@node Overview
@chapter Overview



@cindex overview

The @command{datamash} program
(@url{https://www.gnu.org/software/datamash}) performs calculation (e.g.
@emph{sum,}, @emph{count}, @emph{min}, @emph{max}, @emph{skewness},
@emph{standard deviation}) on input files.

Example: sum up the values in the first column of the input:

@example
@cindex example, sum
$ seq 10 | datamash sum 1
55
@end example

@command{datamash} can group input data and perform operations on each group.
It can sort the file, and read header lines.

Example: Given a file with three fields (name, subject, score),
find the average score in each subject:

@example
$ cat scores.txt
Name        Subject          Score
Bryan       Arts             68
Isaiah      Arts             80
Gabriel     Health-Medicine  100
Tysza       Business         92
Zackery     Engineering      54
...

@cindex sorting
@cindex grouping
@cindex example, sorting
@cindex example, grouping
$ datamash --sort --headers --group 2 mean 3 sstdev 3 < scores.txt
GroupBy(Subject)   mean(Score)   sstdev(Score)
Arts               68.9474       10.4215
Business           87.3636       5.18214
Engineering        66.5385       19.8814
Health-Medicine    90.6154       9.22441
Life-Sciences      55.3333       20.606
Social-Sciences    60.2667       17.2273
@end example


@command{datamash} is designed for interactive exploration of textual data
and for automating tasks in shell scripts.

@command{datamash} has a rich set of statistical functions to quickly assess
information in textual input files. An example of calculating basic statistic
(mean, 1st quartile, median, 3rd quartile, IQR, sample-standard-deviation,
and p-value of Jarque-Bera test for normal distribution:

@cindex example, statistics
@example
$ datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1 < FILE
mean(x)   q1(x)  median(x)  q3(x)   iqr(x)  sstdev(x)  jarque(x)
45.32     23     37         61.5    38.5    30.4487    8.0113-09
@end example



@node Invoking datamash
@chapter Invoking @command{datamash}

@cindex invoking
@cindex options
@cindex usage
@cindex help

The format for running the @command{datamash} program is:

@example
datamash [@var{option}]@dots{} @var{op1} @var{column1} @
[@var{op2} @var{column2} @dots{}]
@end example

Where @var{op1} is the operation to perform on the values in @var{column1}.
@command{datamash} reads input from stdin and performs one or more operations
on the input data. If @option{--group} is used, each operation is performed
on every group. If @option{--group} is not used, each operation is performed on
all the values in the input file.

@vindex LC_NUMERIC
The @env{LC_NUMERIC} locale specifies the decimal-point character and the
thousands separator.

@exdent @command{datamash} supports the following operations:

@table @asis
@item Primary operations:
@code{groupby}, @code{crosstab}, @code{transpose}, @code{reverse},
@code{check}

@item Line-Filtering operations:
@code{rmdup}

@item Per-Line operations:
@code{base64}, @code{debase64}, @code{md5}, @code{sha1},
@code{sha224}, @code{sha256}, @code{sha384}, @code{sha512}, @code{bin},
@code{strbin}, @code{round}, @code{floor}, @code{ceil}, @code{trunc},
@code{frac}, @code{dirname}, @code{basename}, @code{extname}, @code{barename},
@code{getnum}, @code{cut}, @code{echo}

@item Group-by Numeric operations:
@code{sum}, @code{min}, @code{max}, @code{absmin}, @code{absmax}, @code{range}

@item Group-by Textual/Numeric operations:
@code{count}, @code{first}, @code{last}, @code{rand},
@code{unique}, @code{uniq},
@code{collapse}, @code{countunique}

@item Group-by Statistical operations:
@code{mean}, @code{geomean}, @code{harmmean}, @code{mode},
@code{median}, @code{q1}, @code{q3}, @code{iqr}, @code{perc},
@code{antimode}, @code{pstdev}, @code{sstdev}, @code{pvar}, @code{svar},
@code{ms}, @code{rms}, @code{mad}, @code{madraw}, @code{sskew},
@code{pskew}, @code{skurt}, @code{pkurt}, @code{jarque}, @code{dpo},
@code{scov}, @code{pcov}, @code{spearson}, @code{ppearson},
@code{dotprod}

@end table

@exdent Grouping options:

@table @option
@item --skip-comments
@itemx -C
@opindex --skip-comments
@opindex -C
Skip comment lines (starting with '#' or ';' and optional whitespace).

@item --full
@itemx -f
@opindex --full
@opindex -f
Print entire input line before op results (default: print only the grouped
keys).
While using this option with non-linewise operations was historically permitted,
it never produced very sensible output. Such usage has been deprecated, and in a
future release it will result in an error.

@item --group=@var{X[,Y,Z]}
@itemx -g @var{X[,Y,Z]}
@opindex --group
@opindex -g
@cindex grouping
Group input via fields @var{X[,Y,Z]}. By default, fields are separated by TABs.
Use @option{--field-separator} to change the delimiter character. Input file
must be sorted by the same fields @var{X[,Y,Z]}. Use @option{--sort}
to automatically sort the input.
If @option{--group} is not specified, each operation is performed
in the entire input file.
Ranges of field numbers like @var{X-Z} are also supported.

@item --header-in
@opindex --header-in
Indicates the first input line is column headers, and should not be used for
any calculations.

@item --header-out
@opindex --header-out
Print column headers as first line. If the column header names are known (i.e.
the input file had a header line, and the @command{command} was invoked with
@option{--header-in}, @option{-H} or @option{--headers}), prints the operation
and the name of the field (e.g. @samp{mean(X)}). Otherwise, prints the number
operation and the field number (e.g. @samp{mean(field-3)}).

@item --headers
@itemx -H
@opindex --headers
@opindex -H
Same as @samp{--header-in --header-out}. A short option indicating the input
file has a header line, and the output should contain a header line as well.

@item --vnlog
@opindex --vnlog
Enable experimental support for the vnlog data file format for both input
and output.  This format is explained at @url{https://github.com/dkogan/vnlog}.

@item --ignore-case
@itemx -i
@opindex --ignore-case
@opindex -i
Ignore upper/lower case when comparing text for grouping, sorting, and comparing
unique values in the @samp{countunique} and @samp{unique}
(or @samp{uniq}) operations.

@item --sort
@itemx -s
@opindex --sort
@opindex -s
@cindex sorting
Sort the input before grouping. @command{datamash} requires sorted input. If
the input is not sorted, using @option{--sort} will automatically sort the input
before processing it further. Sorting will be performed based on the specified
@option{--group} parameter, and respecting case @option{--ignore-case} option
(if used). The following commands are equivalent:
@example
$ cat FILE | sort -k1,1 | datamash --group 1 sum 1
$ cat FILE | datamash --sort --group 1 sum 1
@end example

@item --sort-cmd=@var{PATH}
@opindex --sort-cmd
@cindex sorting
Use the given program to sort instead of the system @command{sort}

@end table


@exdent File Operation options:

@table @option

@item --no-strict
@opindex --no-strict
Allow lines with varying number of fields. By default, @option{transpose} and
@option{reverse} will fail with an error message unless all input lines have
the same number of fields.

@item --filler=@var{x}
@opindex --filler
When use @option{--no-strict} option, missing fields will be filled with this
value.
@end table

@exdent General options:

@table @option

@item --format=@var{FORMAT}
@opindex --format
print numeric values with printf style floating-point @var{FORMAT}.


@item --field-separator=@var{x}
@itemx -t @var{x}
@opindex --field-separator
@opindex -t
Use character @var{X} instead of TAB as input and output field delimiter.
If @option{--output-delimiter} is also used, it will override the output
field delimiter.

@item --narm
@opindex --narm
Skip @var{NA} or @var{NaN} values.

@item --output-delimiter=@var{x}
@opindex --output-delimiter
@opindex -t
Use character @var{X} instead as output field delimiter.
This option overrides @option{--field-separator}/@option{-t}/
@option{--whitespace}/@option{-W}.

@item --collapse-delimiter=@var{x}
@itemx -c @var{x}
@opindex --collapse-delimiter
@opindex -c

Use character @var{X} instead of comma to delimit items in a
@samp{collapse} or @samp{unique} (aka @samp{uniq}) list.

@item --round=@var{N}
@itemx -R @var{N}
@opindex --round
@opindex -R
Round numeric output to @var{N} decimal places.

@item --whitespace
@itemx -W
@opindex --whitespace
@opindex -W
Use whitespace (one or more spaces and/or tabs) for field delimiters.
Leading whitespace is ignored, trailing whitespace results in an empty field.
TAB character will be used as output field separator.
If @option{--output-delimiter} is also used, it will override the output
field delimiter.

@item --seed
@itemx -S
@opindex --seed
@opindex -S
Select a specific random seed. By default, GNU Datamash uses getrandom(2),
which should be suitable for most purposes. You may wish to force a specific
seed if you either wish to draw on a specific entropy source or for ensuring
the reproducibility of a specific test.

@item --zero-terminated
@itemx -z
@opindex --zero-terminated
@opindex -z
End lines with a 0 byte, not newline.

@item --help
@itemx -h
@opindex --help
@opindex -h
Print an informative help message on standard output and exit
successfully.

@item --version
@itemx -V
@opindex --version
@opindex -V
Print the version number and licensing information of Datamash on
standard output and then exit successfully.

@end table

@node Available Operations
@chapter Available operations in @command{datamash}

@table @asis
@item Primary operations:
@cindex primary operations
@cindex operations, primary

@table @option
@item groupby
alternative syntax for @option{--group}
@item crosstab
cross-tabulate two fields (also known as 'pivot-tables')
@item transpose
transpose rows, columns of a text file
@item reverse
reverse fields in each line of a text file
@item check
verify tabular structure of input (ensure same number of fields in all lines)
@end table

@item Line-Filtering operation:
@cindex line filtering operation
@cindex operations, line filtering

@table @option
@item rmdup
remove lines with duplicated key value
@end table

@item Per-Line operations:
@cindex Per-Line operations
@cindex operations, per-line

@table @option
@item base64
encode the field as base64
@item debase64
decode the field as base64. Exit with an error if the field is invalid base64
value which cannot be decoded.
@item md5
calculates md5 hash of the field
@item sha1
calculates sha1 hash of the field
@item sha224
calculates sha224 hash of the field
@item sha256
calculates sha256 hash of the field
@item sha384
calculates sha384 hash of the field
@item sha512
calculates sha512 hash of the field
@item dirname
extracts the directory name of the field (assuming the field is a file name).
Similar to @command{dirname(1)}.
@item basename
extracts the base file name of the field (assuming the field is a file name).
Similar to @command{basename(1)}.
@item extname
extracts the extension of the file name of the field (assuming the field is a
file name).
@item extname
extracts the base file name of the field without the extension (assuming the
field is a file name).
@item getnum
extract a number from the field. @code{getnum} accepts an optional single
letter option @samp{n/i/d/p/h/o} affecting the detected value.
@item cut
copy input field to output field (similar to @command{cut(1)}).
When the @code{cut} operation is given a list of fields, the fields are copied
in the given order (in contrast to @command{cut(1)}).
@item echo
an alias for @code{cut}.
@end table

@item Group-by Numeric operations:
@cindex numeric operations
@cindex operations, numeric

@table @option
@item sum
sum the of values
@item min
minimum value
@item max
maximum value
@item absmin
minimum of the absolute values
@item absmax
maximum of the absolute values
@item range
range of values (maximum - minimum)
@end table

@item Group-By Textual/Numeric operations:
@cindex Textual operations
@cindex operations, textual

@table @option
@item count
count number of elements in the group
@item first
the first value of the group
@item last
the last value of the group
@item rand
one random value from the group
@item unique
comma-separated sorted list of unique values
@item uniq
an alias for @code{unique}.

@option{--collapse-delimiter} can be used to use a different character
than comma.

@item collapse
comma-separated list of all input values

@option{--collapse-delimiter} can be used to use a different character
than comma.

@item countunique
number of unique/distinct values
@end table

@item Group-By Statistical operations:
@cindex Statistical operations
@cindex operations, statistical

@table @option
@item mean
mean of the values
@item geomean
geometric mean of the values
@item harmmean
harmonic mean of the values
@item trimmean
trimmed mean of the values
@item ms
mean square of the values
@item rms
root mean square of the values
@item median
median value
@item q1
1st quartile value
@item q3
3rd quartile value
@item iqr
inter-quartile range
@item perc
percentile value
@item mode
mode value (most common value)
@item antimode
anti-mode value (least common value)
@item pstdev
population standard deviation
@item sstdev
sample standard deviation
@item pvar
population variance
@item svar
sample variance
@item mad
Median Absolute Deviation,
scaled by a constant 1.4826 for normal distributions
@item madraw
Median Absolute Deviation, unscaled
@item sskew
skewness of the (sample) group
@item pskew
skewness of the (population) group
@item skurt
Excess Kurtosis of the (sample) group
@item pkurt
Excess Kurtosis of the (population) group
@item jarque
p-value of the Jarque-Beta test for normality
@item dpo
p-value of the D'Agostino-Pearson Omnibus test for normality.
@end table

@end table


@node Statistical Operations
@chapter Statistical Operations

@cindex statistics
@cindex operations, statistical
@cindex statistical operations

@unnumberedsec Equivalent R functions
GNU Datamash is designed to closely follow R project's
(@url{https://www.r-project.org/}) statistical functions.
See the @file{files/operators.R} file
for the R equivalent code for each of datamash's operators.
When building @command{datamash} from source code on your local computer,
operators are compared to known results of the equivalent R functions.



@node Usage Examples
@chapter Usage Examples
@cindex usage examples
@cindex examples, usage

@menu
* Summary Statistics::		  count,min,max,mean,stdev,median,quartiles
* Header Lines and Column Names:: Using files with header lines
* Field Delimiters::              Tabs, Whitespace, other delimiters
* Column Ranges::                 Operating on multiple columns
* Reverse and Transpose::         swapping and transposing rows, columns
* Groupby on @file{/etc/passwd}:: Groupby, count, collapse
* Check::                         Validate tabular structure
* Crosstab::                      Cross-tabulation (pivot-tables)
* Rounding numbers::              round, ceil, floor, trunc, frac
* Binning numbers::               assigning numbers into fixed number of buckets
* Binning strings::               assigning strings into fixed number of buckets
* Extracting numeric values::     using getnum
@end menu


@node Summary Statistics
@section Summary Statistics
@cindex summary statistics example
@cindex examples, summary statistics

The following are examples of using @command{datamash} to quickly
calculate summary statistics. The examples will use a file with three
fields (name, subject, score) representing grades of students:

@example
$ cat scores.txt
Shawn     Arts  65
Marques   Arts  58
Fernando  Arts  78
Paul      Arts  63
Walter    Arts  75
...
@end example

Counting how many students study each subject (@emph{subject} is the
second field in the input file, thus @option{groupby 2}):

@example
$ datamash --sort groupby 2 @option{count} 2 < scores.txt
Arts            19
Business        11
Engineering     13
Health-Medicine 13
Life-Sciences   12
Social-Sciences 15
@end example

@cindex min, examples
@cindex max, examples
@cindex examples, min
@cindex examples, max
Similarly, find the minimum and maximum score in each subject:

@example
$ datamash --sort groupby 2 @option{min} 3 @option{max} 3 < scores.txt
Arts             46      88
Business         79      94
Engineering      39      99
Health-Medicine  72     100
Life-Sciences    14      91
Social-Sciences  27      90
@end example

@cindex mean, examples
@cindex standard deviation, examples
@cindex examples, mean
@cindex examples, standard deviation
find the mean and (population) standard deviation in each subject:

@example
$ datamash --sort groupby 2 @option{mean} 3 @option{pstdev} 3 < scores.txt
Arts              68.947  10.143
Business          87.363   4.940
Engineering       66.538  19.101
Health-Medicine   90.615   8.862
Life-Sciences     55.333  19.728
Social-Sciences   60.266  16.643
@end example


@cindex median, examples
@cindex examples, median
@cindex quartiles, examples
@cindex examples, quartiles
Find the median, first, third quartiles and the inter-quartile range in
each subject:

@example
$ datamash --sort groupby 2 @option{median} 3 @option{q1} 3 @option{q3} @
3 @option{iqr} 3  < scores.txt
Arts              71      61.5      75.5     14
Business          87      83        92        9
Engineering       56      51        83       32
Health-Medicine   91      84       100       16
Life-Sciences     58.5    44.25     67.75    23.5
Social-Sciences   62      55        70.5     15.5
@end example


@xref{Header Lines and Column Names} for examples of dealing with
header lines.

@node Header Lines and Column Names
@section Header Lines and Column Names

@opindex --header-out
@cindex examples, header
@cindex examples, header-out
@cindex header, examples
@cindex header-out, examples
@unnumberedsubsec Output Header Lines

If the input does @emph{not} have a header line, use
@option{--header-out} to add a header in the first line of the output,
indicating which operation was performed:

@example
$ datamash --sort @option{--header-out} groupby 2 @option{min} @
3 @option{max} 3 < scores.txt
GroupBy(field-2)  min(field-3)  max(field-3)
Arts              46            88
Business          79            94
Engineering       39            99
Health-Medicine   72           100
Life-Sciences     14            91
Social-Sciences   27            90
@end example


@unnumberedsubsec Skipping Input Header Lines

@opindex --header-in
@cindex examples, header
@cindex examples, header-in
@cindex header-in, examples

If the input has a header line (first line containing column names),
use @option{--header-in} to skip the line:

@example
$ cat scores_h.txt
Name      Major   Score
Shawn     Arts    65
Marques   Arts    58
Fernando  Arts    78
Paul      Arts    63
...


$ datamash --sort @option{--header-in} groupby 2 mean 3 < scores_h.txt
Arts             68.947
Business         87.363
Engineering      66.538
Health-Medicine  90.615
Life-Sciences    55.333
Social-Sciences  60.266
@end example

If the header line is not skipped, @command{datamash} will show an error
(due to strict input validation):

@example
$ datamash groupby 2 mean 3 < scores_h.txt
datamash: invalid numeric value in line 1 field 3: 'Score'
@end example


@unnumberedsubsec Using Header Lines

@opindex --headers
@opindex -H
@cindex examples, headers
@cindex headers, examples

Column names in the input header lines can be printed
in the output header lines by using @option{--headers}
(or @option{-H}, both are equivalent to @option{--header-in --header-out}):

@example
$ datamash --sort @option{--headers} groupby 2 mean 3 < scores_h.txt
GroupBy(Major)    mean(Score)
Arts              68.947
Business          87.363
Engineering       66.538
Health-Medicine   90.615
Life-Sciences     55.333
Social-Sciences   60.266
@end example

Or in short form (@option{-sH} instead of @option{--sort --headers}),
equivalent to the above command:

@example
$ datamash @option{-sH} groupby 2 mean 3
@end example


@unnumberedsubsec Column Names
@cindex column names
@cindex field names

When the input file has a header line, column names can be used
instead of column numbers. In the example below, @var{Major}
is used instead of the value 2, and @var{Score} is used
instead of the value 3:

@example
$ datamash --sort --headers groupby Major mean Score < scores_h.txt
GroupBy(Major)    mean(Score)
Arts              68.947
Business          87.363
Engineering       66.538
Health-Medicine   90.615
Life-Sciences     55.333
Social-Sciences   60.266
@end example

@command{datamash} will read the first line of the input, and deduce
the correct column number based on the given name. If the column name
is not found, an error will be printed:

@example
$ datamash --sort --headers groupby 2 mean @option{Foo}  < scores_h.txt
datamash: column name 'Foo' not found in input file
@end example


Field names must be escaped with a backslash if they start with a digit
or contain special characters (dash/minus, colons, commas).
Note the interplay between escaping with backslash and shell quoting.
The following equivalent command sum the values of a field named @samp{FOO-BAR}:

@example
$ datamash -H sum FOO\\-BAR < input.txt
$ datamash -H sum 'FOO\-BAR' < input.txt
$ datamash -H sum "FOO\\-BAR" < input.txt
@end example



@node Field Delimiters
@section Field Delimiters
@cindex field delimiters
@cindex whitespace delimiters
@cindex delimiters, whitespace
@cindex tab delimiters
@cindex delimiters, tabs

@command{datamash} uses tabs (ASCII character 0x09) as default field
delimiters.  Use @option{-W} to treat one or more consecutive
whitespace characters as field delimiters. Use @option{-t},
@option{--field-separator} to set a custom field delimiter.

The following examples illustrate the various options.

By default, fields are separated by a single tab. Multiple tabs
denotes multiple fields (this is consistent with GNU coreutils'
@command{cut}):

@example
$ printf '1\t\t2\n' | datamash sum 3
2
$ printf '1\t\t2\n' | cut -f3
2
@end example

Every tab separates two fields.  A line starting with a tab thus starts
with an empty field, and a line ending with a tab ends with an empty field.

Using @option{-W}, one or more consecutive whitespace characters
are treated as a single field delimiter:

@example
$ printf '1  \t  2\n' | datamash -W sum 2
2
$ printf '1  \t  2\n' | datamash -W sum 3
datamash: invalid input: field 3 requested, line 1 has only 2 fields
@end example

With @option{-W}, leading whitespace is ignored, but trailing whitespace
is significant.  A line starting with one or more consecutive whitespace
characters followed by a non-whitespace character starts with a non-empty
field.  A line ending with one or more consecutive whitespace characters
ends with an empty field.

Using @option{-t}, a custom field delimiter character can be specified.
Multiple consecutive delimiters are treated as multiple fields:

@example
$ printf '1,10,,100\n' | datamash -t, sum 4
100
@end example



@node Column Ranges
@section Column Ranges
@cindex column ranges
@cindex ranges, columns
@cindex multiple columns

@command{datamash} accepts column ranges such as @var{1,2,3} and @var{1-3}.


Simulating input with multiple columns:

@example
$ seq 100 | paste - - - -
1    2    3    4
5    6    7    8
9   10   11   12
13  14   15   16
17  18   19   20
...
@end example

The following are equivalent:

@example
$ seq 100 | paste - - - - | datamash sum @option{1 sum 2 sum 3 sum 4}
1225  1250   1275   1300

$ seq 100 | paste - - - - | datamash sum @option{1,2,3,4}
1225  1250   1275   1300

$ seq 100 | paste - - - - | datamash sum @option{1-4}
1225  1250   1275   1300

$ seq 100 | paste - - - - | datamash sum @option{1-3,4}
1225  1250   1275   1300
@end example

Ranges can be used with multiple operations:

@example
$ seq 100 | paste - - - - | datamash @option{sum 1-4 mean 1-4}
1225  1250   1275   1300   49   50   51   52
@end example




@node Reverse and Transpose
@section Reverse and Transpose

@unnumberedsubsec Transpose
@cindex transpose
@cindex swap rows, columns

Use @option{transpose} to swap rows and columns in a file:

@example
$ cat input.txt
Sample   Year   Count
A        2014   1002
B        2013    990
C        2014   2030
D        2014    599

$ datamash @option{transpose} < input.txt
Sample  A       B       C       D
Year    2014    2013    2014    2014
Count   1002    990     2030    599
@end example


@cindex strict mode
@cindex input validation, transpose
@cindex transpose, input validation
By default, @option{transpose} verifies the input has the same number
of fields in each line, and fails with an error otherwise:

@example
$ cat input.txt
Sample   Year   Count
A        2014   1002
B        2013
C        2014   2030
D        2014    599


$ datamash @option{transpose} < input1.txt
datamash: transpose input error: line 3 has 2 fields (previous lines had 3);
see --help to disable strict mode
@end example

Use @option{--no-strict} to allow missing values:

@opindex --no-strict
@cindex strict, transpose
@cindex transpose, strict
@example
$ datamash @option{--no-strict} transpose < input1.txt
Sample  A       B        C        D
Year    2014    2013     2014     2014
Count   1002    N/A      2030     599
@end example

@opindex --filler
@cindex missing values, transpose
@cindex transpose, missing values
@cindex transpose, filler value
Use @option{--filler} to set the missing-field filler value:

@example
$ datamash --no-strict @option{--filler XYZ} transpose < input1.txt
Sample  A       B        C        D
Year    2014    2013     2014     2014
Count   1002    XYZ      2030     599
@end example



@unnumberedsubsec Reverse
@cindex reverse columns
@cindex columns, reverse

Use @option{reverse} to reverse the fields order in a file:

@example
$ cat input.txt
Sample   Year   Count
A        2014   1002
B        2013    990
C        2014   2030
D        2014    599

$ datamash @option{reverse} < input.txt
Count   Year    Sample
1002    2014    A
990     2013    B
2030    2014    C
599     2014    D
@end example

@cindex reverse, strict
@cindex strict, reverse
By default, reverse verifies the input has the same number of fields
in each line, and fails with an error otherwise. Use
@option{--no-strict} to disable this behavior (see section
above for an example).



@unnumberedsubsec Combining Reverse and Transpose

@cindex tac
@cindex reversing lines
@cindex reverse, and transpose
@cindex transpose, and reverse
Reverse and Transpose can be combined to achieve various manipulations.
(reminder: @url{https://www.gnu.org/software/coreutils/tac,tac} can be
used to reverse lines in a file):

@example
$ cat input.txt
A       1       xx
B       2       yy
C       3       zz


$ tac input.txt
C       3       zz
B       2       yy
A       1       xx


$ tac input.txt | datamash reverse
zz      3       C
yy      2       B
xx      1       A


$ cat input.txt | datamash reverse | datamash transpose
xx      yy      zz
1       2       3
A       B       C

$ tac input.txt | datamash reverse | datamash transpose
zz      yy      xx
3       2       1
C       B       A
@end example



@node Groupby on @file{/etc/passwd}
@section Groupby on @file{/etc/passwd}
@cindex groupby
@cindex @file{/etc/passwd}, examples
@cindex examples, @file{/etc/passwd}

@command{datamash} with the @option{groupby} operation mode
can be used to aggregate information.

Using this simulated @file{/etc/passwd} file as input:

@example
$ cat passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13:proxy:/bin:/usr/sbin/nologin
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
backup:x:34:34:backup:/var/backups:/usr/sbin/nologin
list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin
mysql:x:115:124:MySQL Server,,,:/var/lib/mysql:/bin/false
sshd:x:116:65534::/var/run/sshd:/usr/sbin/nologin
guest:x:118:125:Guest,,,:/tmp/guest-home.phc17z:/bin/bash
gordon:x:1004:1000:Assaf Gordon,,,,:/home/gordon:/bin/bash
charles:x:1005:1000:Charles,,,,:/home/charles:/bin/bash
alice:x:1006:1000:Alice,,,,:/home/alice:/bin/bash
bob:x:1007:1000:Bob,,,,:/home/bob:/bin/bash
postgres:x:119:126:PostgreSQL administrator,,,:/var/lib/postgresql:/bin/bash
rabbitmq:x:125:138:RabbitMQ messaging server,,,:/var/lib/rabbitmq:/bin/false
redis:x:126:140:redis server,,,:/var/lib/redis:/bin/false
postfix:x:127:141::/var/spool/postfix:/bin/false
@end example

@opindex -t
@opindex --field-separator
Parameter @option{-t} is used to indicate the field separator @var{:}
(instead of the default @var{tab}).

@cindex groupby, and count
@cindex count
@cindex login shell, examples
Aggregate (@option{groupby}) login shells (column 7) and
@option{count} how many users use each:

@example
$ datamash -t: --sort groupby 7 count 7 < passwd
/bin/bash:7
/bin/false:4
/bin/sync:1
/usr/sbin/nologin:14
@end example

@cindex groupby, and collapse
@cindex collapse
Aggregate (@option{groupby}) login shells (column 7) and print
comma-separated list of users (column 1) for each shell
(@option{collapse}):

@example
$ cat passwd | datamash -t: --sort groupby 7 collapse 1
/bin/bash:root,guest,gordon,charles,alice,bob,postgres
/bin/false:mysql,rabbitmq,redis,postfix
/bin/sync:sync
/usr/sbin/nologin:daemon,bin,sys,games,man,lp,mail,news,uucp,proxy@
,www-data,backup,list,sshd
@end example

Aggregate unix-groups (column 4) and print
comma-separated list of users (column 1) for in each group:

@example
$ datamash -t: --sort groupby 4 collapse 1 < /etc/passwd
0:root
1:daemon
10:uucp
1000:gordon,charles,alice,bob
12:man
124:mysql
125:guest
126:postgres
13:proxy
138:rabbitmq
140:redis
141:postfix
2:bin
3:sys
33:www-data
34:backup
38:list
60:games
65534:sync,sshd
7:lp
8:mail
9:news
@end example



@node Check
@section Check - checking tabular structure
@cindex check
@cindex checking tabular structure

@command{datamash} @option{check} validates the tabular structure of a
file, ensuring all lines have the same number of
fields. @option{check} is meant to be used in scripting and automation
pipelines, as it will terminate with non-zero exit code if the file is
not well structured, while also printing detailed context information
about the offending lines:

@example
$ cat good.txt
A    1    ww
B    2    xx
C    3    yy
D    4    zz


$ cat bad.txt
A    1    ww
B    2    xx
C    3
D    4    zz


$ datamash check < good.txt && echo ok || echo fail
4 lines, 3 fields
ok


$ datamash check < bad.txt && echo ok || echo fail
line 2 (3 fields):
  B  2 xx
line 3 (2 fields):
  C  3
datamash: check failed: line 3 has 2 fields (previous line had 3)
fail
@end example

@subsection Expected number of lines/fields

@option{check} accepts optional @var{lines} and @var{fields} and will
return failure if the input does not have the requested number of lines/fields.

@exdent The syntax is:

@example
datamash check [@var{N} lines] [@var{N} fields]
@end example

@exdent Usage examples:

@example
$ cat file.txt
A    1    ww
B    2    xx
C    3    yy
D    4    zz

$ datamash check 4 lines < file.txt && echo ok
4 lines, 3 fields
ok

$ datamash check 3 fields < file.txt && echo ok
4 lines, 3 fields
ok

$ datamash check 4 lines 3 fields < file.txt && echo ok
4 lines, 3 fields
ok

$ datamash check 7 fields < file.txt && echo ok
line 1 (3 fields):
  A    1    ww
datamash: check failed: line 1 has 3 fields (expecting 22)

$ datamash check 10 lines < file.txt && echo ok
datamash: check failed: input had 4 lines (expecting 10)
@end example

For convenience, @var{line},@var{row},@var{rows}
can be used instead of @var{lines};
@var{field},@var{columns},@var{column},@var{col} can be used
instead of @var{fields}.
The following are all equivalent:

@example
datamash check 4 lines 10 fields < file.txt
datamash check 4 rows  10 columns < file.txt
datamash check 10 col 4 row < file.txt
@end example


@subsection checks in automation scripts

@cindex fail fast
@cindex shell scripts, check
@cindex check, in automation and shell scripts
In pipeline/automation context, it is often beneficial to validate
files as early as possible (immediately after file is created, as in
@url{https://en.wikipedia.org/wiki/Fail-fast, fail-fast methodology}).
A typical usage in a shell script would be:

@example
@verbatim
#!/bin/sh

die()
{
    base=$(basename "$0")
    echo "$base: error: $@" >&2
    exit 1
}

custom pipeline-or-program > output.txt \
    || die "program failed"

datamash check < output.txt \
    || die "'output.txt' has invalid structure (missing fields)"
@end verbatim
@end example

If the generated @file{output.txt} file has invalid structure
(i.e. missing fields), @command{datamash} will print the @file{stderr}
enough details to help in troubleshooting (line numbers and offending
line's content).

@node Crosstab
@section Crosstab - Cross-Tabulation (pivot-tables)
@cindex crosstab
@cindex pivot tables
@cindex cross tabulation

Cross-tabulation compares the relationship between two fields.
Given the following input file:

@example
$ cat input.txt
a    x    3
a    y    7
b    x    21
a    x    40
@end example

@opindex count
@cindex count, crosstab and
Show cross-tabulation between the first field (a/b) and the second
field (x/y) - counting how many times each pair appears (note: sorting
is required):

@example
$ datamash -s crosstab 1,2 < input.txt
     x    y
a    2    1
b    1    N/A
@end example

The default operation is @option{count} - in the above example,
@var{a} and @var{x} appear twice in the input file, while @var{b} and @var{y}
never appear together.

An optional grouping operation can be used instead of counting.

@opindex sum
@cindex sum, crosstab and
@cindex crosstab and sum
For each pair, @option{sum} the values in the third column:

@example
$ datamash -s crosstab 1,2 sum 3 < input.txt
     x    y
a    43   7
b    21   N/A
@end example

@opindex unique
@cindex unique, crosstab and
@cindex crosstab and unique
For each pair, list all @option{unique} values in the third column:

@example
$ datamash -s crosstab 1,2 unique 3 < input.txt
     x    y
a    3,40 7
b    21   N/A
@end example

@opindex --header-out
@cindex --header-out, crosstab and
@cindex crosstab and --header-out
Note that using @option{--header-out} with crosstab prints a line showing
how to interpret the rows and columns, and what operation was used.

@example
$ datamash -s --header-in --header-out crosstab 1,2 < input.txt
GroupBy(a) GroupBy(x) count(a)
     x    y
a    1    1
b    1    N/A
@end example

@node Rounding numbers
@section Rounding numbers

@cindex rounding numbers
@opindex round
@opindex ceil
@opindex floor
@opindex trunc
@opindex frac

The following demonstrate the different rounding operations:

@example
$ ( echo X ; seq -1.25 0.25 1.25 ) \
      | datamash --full -H round 1 ceil 1 floor 1 trunc 1 frac 1

  X     round(X)  ceil(X)  floor(X)  trunc(X)   frac(X)
-1.25   -1        -1       -2        -1         -0.25
-1.00   -1        -1       -1        -1          0
-0.75   -1         0       -1         0         -0.75
-0.50   -1         0       -1         0         -0.5
-0.25    0         0       -1         0         -0.25
 0.00    0         0        0         0          0
 0.25    0         1        0         0          0.25
 0.50    1         1        0         0          0.5
 0.75    1         1        0         0          0.75
 1.00    1         1        1         1          0
 1.25    1         2        1         1          0.25
@end example


@node Binning numbers
@section Binning numbers
@opindex bin
@cindex buckets, binning numbers
@cindex binning numbers

Bin input values into buckets of size 5:

@example
$ ( echo X ; seq -10 2.5 10 ) \
      | datamash -H --full bin:5 1
    X  bin(X)
-10.0    -10
 -7.5    -10
 -5.0     -5
 -2.5     -5
  0.0      0
  2.5      0
  5.0      5
  7.5      5
 10.0     10
@end example

@node Binning strings
@section Binning strings
@opindex strbin
@cindex buckets, binning strings
@cindex binning strings

Hash any string input value into a numeric integer.
A typical usage would be to split an input file
into @var{N} chunks, ensuring that all values of a certain key will
be stored in the same chunk:

@example
$ cat input.txt
PatientA   10
PatientB   11
PatientC   12
PatientA   14
PatientC   15
@end example

Each patient ID is hashed into a bin between 0 and 9
and printed in the last field:

@example
$ datamash --full strbin 1 < input.txt
PatientA   10    5
PatientB   11    6
PatientC   12    7
PatientA   14    5
PatientC   15    7
@end example

Splitting the input into chunks can be done with awk:

@example
@verbatim
$ cat input.txt | datamash --full strbin 1 \
    | awk '{print > $NF ".txt"}'
@end verbatim
@end example


@node Extracting numeric values
@section Extracting numeric values - using getnum
@opindex getnum
@cindex numbers, extracting from a field

The @code{getnum} operation extracts a numeric value from the field:

@example
@verbatim
$ echo zoom-123.45xyz | datamash getnum 1
123.45
@end verbatim
@end example


@code{getnum} accepts an optional single-letter @var{TYPE} option:

@table @option
@item getnum:n
natural numbers (positive integers, including zero)
@item getnum:i
integers
@item getnum:d
decimal point numbers
@item getnum:p
positive decimal point numbers (this is the default)
@item getnum:h
hex numbers
@item getnum:o
octal numbers
@end table

Examples:

@example
@verbatim
$ echo zoom-123.45xyz | datamash getnum 1
123.45

$ echo zoom-123.45xyz | datamash getnum:n 1
123

$ echo zoom-123.45xyz | datamash getnum:i 1
-123

$ echo zoom-123.45xyz | datamash getnum:d 1
123.45

$ echo zoom-123.45xyz | datamash getnum:p 1
-123.45

# Hex 0x123 = 291 Decimal
$ echo zoom-123.45xyz | datamash getnum:h 1
291

# Octal 0123 = 83 Decimal
$ echo zoom-123.45xyz | datamash getnum:o 1
83
@end verbatim
@end example




@node Reporting bugs
@chapter Reporting bugs

@cindex bug reporting
@cindex problems
@cindex reporting bugs

To report bugs, suggest enhancements or otherwise discuss GNU Datamash,
please send electronic mail to @email{bug-datamash@@gnu.org}.

@cindex checklist for bug reports
For bug reports, please include enough information for the maintainers
to reproduce the problem.  Generally speaking, that means:

@itemize @bullet
@item The version numbers of Datamash (which you can find by running
      @w{@samp{datamash --version}}) and any other program(s) or
      manual(s) involved.
@item Hardware and operating system names and versions.
@item The contents of any input files necessary to reproduce the bug.
@item The expected behavior and/or output.
@item A description of the problem and samples of any erroneous output.
@item Options you gave to @command{configure} other than specifying
      installation directories.
@item Anything else that you think would be helpful.
@end itemize

When in doubt whether something is needed or not, include it.  It's
better to include too much than to leave out something important.

@cindex patches, contributing
Patches are welcome; if possible, please make them with @samp{@w{diff
-u}} (@pxref{Top,, Overview, diff, Comparing and Merging Files}) and
include @file{ChangeLog} entries (@pxref{Change Log,,, emacs, The GNU
Emacs Manual}).  Please follow the existing coding style.


@node GNU Free Documentation License
@appendix GNU Free Documentation License

@include fdl.texi


@node Concept index
@unnumbered Concept index

@printindex cp

@bye