File: readme.md

package info (click to toggle)
datamash 1.9-1
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 13,600 kB
  • sloc: ansic: 65,320; sh: 8,982; perl: 5,127; makefile: 250; sed: 16
file content (240 lines) | stat: -rw-r--r-- 7,966 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
# GNU Datamash - Usage Examples

This directory contains sample files demonstrating typical usage of the `datamash`
program.

`datamash` is a command-line calculator of basic operations on columnar input files.

## Synopsis

`datamash` reads input from STDIN, and performs operations (e.g. sum, mean, count) on
specified columns:

    datamash [OPTIONS] op1 column1 [op2 column2]...

**op1** is the operation to perform, one of: count,sum,min,max,absmin,
absmax,mean,median,mode,antimode,pstdev,sstdev,pvar,svar.

**column1** is the column (in the input file) to use for **op1**.

**OPTIONS** are possible command-line options which affect the behaviour of `datamash`.


Example: sum and count the number of even values between 0 and 100:

    $ seq 0 2 100 | datamash sum 1
    2550
    $ seq 0 2 100 | datamash count 1
    51


## Example: Test Scores

The file `scores.txt` contains tests scores of college students of different majors
(Arts, Business, Health and Medicine, Life Sciences, Engineering, Social Sciences).

The files has three columns: Name, Major, Score:

    $ cat scores.txt
    Shawn     Arts  65
    Marques   Arts  58
    Fernando  Arts  78
    Paul      Arts  63
    Walter    Arts  75
    ...

Using `datamash`, find the lowest (min) and highest (max) score for each College Major:
(Major is in column 2, the score values are in column 3):

    $ datamash -g 2 min 3 max 3 < scores.txt
    Arts            46  88
    Business        79  94
    Health-Medicine 72  100
    Social-Sciences 27  90
    Life-Sciences   14  91
    Engineering     39  99

Similarly, find the number of students, mean score and sample-standard-deviation for each College major:

    $ datamash -g 2 count 3 mean 3 sstdev 3 < scores.txt
    Arts             68.9474  10.4215
    Business         87.3636  5.18214
    Health-Medicine  90.6154  9.22441
    Social-Sciences  60.2667  17.2273
    Life-Sciences    55.3333  20.606
    Engineering      66.5385  19.8814


## Example: Header Lines

A *header line* is an optional first line in the input or output files, which labels each column.
`datamash` can generate header line in the output file, even if the input file doesn't have a header line (`scores.txt` does not have a header line, the first line in the file contains data).

Use '--header-out' to add a header line to the output (when the input does not contain a header line):

    $ datamash --header-out -g 2 count 3 mean 3 pstdev 3 < scores.txt
    GroupBy(field-2)    mean(field-3)  sstdev(field-3)
    Arts                68.9474        10.4215
    Business            87.3636        5.18214
    Health-Medicine     90.6154        9.22441
    Social-Sciences     60.2667        17.2273
    Life-Sciences       55.3333        20.606
    Engineering         66.5385        19.8814


When the input file has a header line, `datamash` can will use the labels from each column in the output header line. `scores_h.txt` contains the same information as `scores.txt`, with an additional header line:

    $ cat scores_h.txt
    Name        Major   Score
    Shawn       Arts    65
    Marques     Arts    58
    Fernando    Arts    78
    Paul        Arts    63
    Walter      Arts    75
    ...


Use `-H` (equivalent to `--header-in --header-out`) to use input headers and print output headers:

    $ datamash -H -g 2 count 3 mean 3 pstdev 3 < scores_h.txt
    GroupBy(Major)      mean(Score)    sstdev(Score)
    Arts                68.9474        10.4215
    Business            87.3636        5.18214
    Health-Medicine     90.6154        9.22441
    Social-Sciences     60.2667        17.2273
    Life-Sciences       55.3333        20.606
    Engineering         66.5385        19.8814


## Example: Human Genes

**NOTE:** The follow examples assume some biology background knowledge.

The `genes.txt` file contains a small subset of the Human Genome genes.
The full dataset is available at the [UCSC Genome Browser](http://hgdownload.cse.ucsc.edu/downloads.html)'s
[Human Genome Database](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/) as
[refGene.txt.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz).

The columns of `genes.txt` are:

1. bin
2. name - isoform/transcript identifier
3. chromosome
4. strand
5. txStart - transcription start site
6. txEnda - transcription end site
7. cdsStart - coding start site
8. cdsEnd - coding end site
9. exonCount - number of exons
10. exonStarts
11. exonEnds
12. score
13. GeneName - gene identifier
14. cdsStartStat
15. cdsEndStat
16. exonFrames

### Number of isoforms per gene

The gene identifiers are in column 13, the transcript identifiers are in column 2.
To count how many isoforms each gene has, use `datamash` to group by column 13, and for each group, count the values in column 2 (use `-s` to automatically sort the input file):

    $ datamash -s -g 13 count 2 < genes.txt
    ABCC1   1
    ABCC10  2
    ABCC11  3
    ABCC12  1
    ABCC13  2
    ...

Using the `collapse` operation, we can print all the isoforms for each gene:

    $ datamash -s -g 13 count 2 collapse 2 < genes.txt
    ABCC1   1  NM_004996
    ABCC10  2  NM_001198934,NM_033450
    ABCC11  3  NM_032583,NM_033151,NM_145186
    ABCC12  1  NM_033226
    ABCC13  2  NR_003087,NR_003088
    ...


### Combining datamash with other programs

Combining `datamash` with additional filtering programs (such as `awk`), we can find relevant information, such as:

Which genes have more than 5 isoforms?

    $ cat genes.txt | datamash -s -g 13 count 2 collapse 2 | awk '$2>5'
    AC159540.1  6  NR_040097,NR_103732,NR_103733,NR_040097,NR_103732,NR_103733
    ACSF3       6  NM_001127214,NM_001243279,NM_001284316,NM_174917,NR_045667,NR_104293
    ADAM29      7  NM_001130703,NM_001130704,NM_001130705,NM_001278125,NM_001278126,NM_001278127,NM_014269
    AIPL1       8  NM_001033054,NM_001033055,NM_001285399,NM_001285400,NM_001285401,NM_001285402,NM_001285403,NM_014336
    ANXA8       6  NM_001040084,NM_001271702,NM_001271703,NM_001040084,NM_001271702,NM_001271703
    ...


Using `datamash` we can quickly explore the dataset and answer simple question, such as:

How many genes are transcribed from both strands (that is, they have isoforms with both positive and negative strands.
strand column is number 4):

    $ cat genes.txt | datamash -s -g 13 countunique 4 | awk '$2>1'
    AC159540.1   2
    AMY1C        2
    ANXA8        2
    BMS1P17      2
    BMS1P18      2
    ...

Which genes are transcribed from multiple chromosomes (that is, they have isoforms from multiple chromosomes.
Chromosome column is number 3):

    $ cat genes.txt | datamash -s -g 13 countunique 2 unique 2 | awk '$2>1'
    AKAP17A      2   chrX,chrY
    ASMT         2   chrX,chrY
    ASMTL        2   chrX,chrY
    ASMTL-AS1    2   chrX,chrY
    BMS1P17      2   chr14,chr22
    ...


Explore Exon-count variability (for each gene, list the minimum, maximum, mean and stddev of the
exon-count of its isoforms. Exon-Count column is number 9):

    $ cat genes.txt | datamash -s -g 13 count 9 min 9 max 9 mean 9 pstdev 9 | awk '$2>1'
    ABCC10     2   20   22     21   1
    ABCC11     3   29   30   29.3   0.471405
    ABCC13     2    5    6    5.5   0.5
    ABCC3      2   12   31   21.5   9.5
    AC159540.1 6    4    5    4.1   0.372678
    ...

### Grouping multiple fields

Chromosome name is in column 3. How many transcripts are in each chromosome?

    $ datamash -s -g 3 count 2 < genes.txt
    chr1  365
    chr10 164
    chr11 189
    chr12 187
    chr13 66
    ...

Strand information is in column 4. How many transcripts are in each chromsomse AND strand?

    $ datamash -s -g 3,4 count 2 < genes.txt
    chr1  - 183
    chr1  + 182
    chr10 -  52
    chr10 + 112
    chr11 - 105
    chr11 +  84
    chr12 - 117
    chr12 +  70
    ...


## More Information

For more information about `datamash` usage, run `datamash --help`, and the [datamash Website](https://www.gnu.org/software/datamash).