File: parse.t

package info (click to toggle)
augur 24.4.0-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 25,312 kB
  • sloc: python: 14,253; sh: 227; makefile: 35
file content (164 lines) | stat: -rw-r--r-- 6,319 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
Integration tests for augur parse.

  $ pushd "$TESTDIR" > /dev/null
  $ export AUGUR="${AUGUR:-../../bin/augur}"

Try to parse Zika sequences without specifying fields.
This should fail.

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv"
  usage: .* (re)
  .* (re)
  .* (re)
  .* (re)
  .* (re)
  .* (re)
  augur parse: error: the following arguments are required: --fields
  [2]

Parse Zika sequences into sequences and metadata.

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --fields strain virus accession date region country division city db segment authors url title journal paper_url \
  >   --prettify-fields region country division city \
  >   --fix-dates monthfirst

  $ diff -u "parse/sequences.fasta" "$TMP/sequences.fasta"
  $ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
  $ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Parse Zika sequences into sequences and metadata using a different metadata field as record id (e.g. accession)

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --output-id-field accession \
  >   --fields strain virus accession date region country division city db segment authors url title journal paper_url \
  >   --prettify-fields region country division city \
  >   --fix-dates monthfirst

  $ diff -u "parse/sequences_other.fasta" "$TMP/sequences.fasta"
  $ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
  $ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Try to parse Zika sequences with a misspelled field.
This should fail.

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --output-id-field notexist \
  >   --fields strain virus accession date region country division city db segment authors url title journal paper_url \
  >   --prettify-fields region country division city \
  >   --fix-dates monthfirst
  ERROR: Output id field 'notexist' not found in fields ['strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url'].
  [2]

Parse Zika sequences into sequences and metadata, preferred default ids is 'name', then 'strain', then first field.

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --fields strain virus name date region country division city db segment authors url title journal paper_url \
  >   --prettify-fields region country division city \
  >   --fix-dates monthfirst
  DEPRECATED: The default search order for the ID field will be changing from ('name', 'strain') to ('strain', 'name').
  Users who prefer to keep using 'name' instead of 'strain' should use the parameter: --output-id-field 'name'

  $ diff -u "parse/sequences_other.fasta" "$TMP/sequences.fasta"
  $ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Parse Zika sequences into sequences and metadata when there is no 'name' field.
This should use the 2nd entry in DEFAULT_ID_COLUMNS ('name', 'strain') instead.

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --fields col1 virus strain date region country division city db segment authors url title journal paper_url \
  >   --prettify-fields region country division city \
  >   --fix-dates monthfirst

  $ diff -u "parse/sequences_other.fasta" "$TMP/sequences.fasta"
  $ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Parse Zika sequences into sequences and metadata when no output-id-field is provided and none of the fields match DEFAULT_ID_COLUMNS (e.g. ('strain', 'name')).
This should use the first field as the id field and the metadata should not have an extra strain or name column.

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --fields col1 virus col3 date region country division city db segment authors url title journal paper_url \
  >   --prettify-fields region country division city \
  >   --fix-dates monthfirst

  $ diff -u "parse/sequences.fasta" "$TMP/sequences.fasta"
  $ diff "parse/metadata.tsv" "$TMP/metadata.tsv" | tr '>' '+' | tr '<' '-'
  1c1
  - strain\tvirus\taccession\tdate\tregion\tcountry\tdivision\tcity\tdb\tsegment\tauthors\turl\ttitle\tjournal\tpaper_url (esc)
  ---
  + col1\tvirus\tcol3\tdate\tregion\tcountry\tdivision\tcity\tdb\tsegment\tauthors\turl\ttitle\tjournal\tpaper_url (esc)
  $ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Parse compressed Zika sequences into sequences and metadata.

  $ ${AUGUR} parse \
  >   --sequences parse/zika.fasta.gz \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --fields strain virus accession date region country division city db segment authors url title journal paper_url \
  >   --prettify-fields region country division city \
  >   --fix-dates monthfirst

  $ diff -u "parse/sequences.fasta" "$TMP/sequences.fasta"
  $ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
  $ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"

Error on the first duplicate.

  $ cat >$TMP/data.fasta <<~~
  > >SEQ1
  > AAA
  > >SEQ1
  > AAA
  > >SEQ2
  > AAA
  > >SEQ2
  > AAA
  > ~~
  $ ${AUGUR} parse \
  >   --sequences $TMP/data.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --fields strain
  ERROR: Duplicate found for 'SEQ1'.
  [2]

Run without --fix-dates. The date is left unchanged.

  $ cat >$TMP/data.fasta <<~~
  > >SEQ1|05/01/2020
  > AAA
  > ~~
  $ ${AUGUR} parse \
  >   --sequences $TMP/data.fasta \
  >   --output-sequences "$TMP/sequences.fasta" \
  >   --output-metadata "$TMP/metadata.tsv" \
  >   --fields strain date

  $ cat "$TMP/metadata.tsv"
  strain	date
  SEQ1	05/01/2020
  $ rm -f "$TMP/metadata.tsv"

  $ popd > /dev/null