1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164
|
Integration tests for augur parse.
$ pushd "$TESTDIR" > /dev/null
$ export AUGUR="${AUGUR:-../../bin/augur}"
Try to parse Zika sequences without specifying fields.
This should fail.
$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv"
usage: .* (re)
.* (re)
.* (re)
.* (re)
.* (re)
.* (re)
augur parse: error: the following arguments are required: --fields
[2]
Parse Zika sequences into sequences and metadata.
$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields strain virus accession date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
$ diff -u "parse/sequences.fasta" "$TMP/sequences.fasta"
$ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"
Parse Zika sequences into sequences and metadata using a different metadata field as record id (e.g. accession)
$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --output-id-field accession \
> --fields strain virus accession date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
$ diff -u "parse/sequences_other.fasta" "$TMP/sequences.fasta"
$ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"
Try to parse Zika sequences with a misspelled field.
This should fail.
$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --output-id-field notexist \
> --fields strain virus accession date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
ERROR: Output id field 'notexist' not found in fields ['strain', 'virus', 'accession', 'date', 'region', 'country', 'division', 'city', 'db', 'segment', 'authors', 'url', 'title', 'journal', 'paper_url'].
[2]
Parse Zika sequences into sequences and metadata, preferred default ids is 'name', then 'strain', then first field.
$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields strain virus name date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
DEPRECATED: The default search order for the ID field will be changing from ('name', 'strain') to ('strain', 'name').
Users who prefer to keep using 'name' instead of 'strain' should use the parameter: --output-id-field 'name'
$ diff -u "parse/sequences_other.fasta" "$TMP/sequences.fasta"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"
Parse Zika sequences into sequences and metadata when there is no 'name' field.
This should use the 2nd entry in DEFAULT_ID_COLUMNS ('name', 'strain') instead.
$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields col1 virus strain date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
$ diff -u "parse/sequences_other.fasta" "$TMP/sequences.fasta"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"
Parse Zika sequences into sequences and metadata when no output-id-field is provided and none of the fields match DEFAULT_ID_COLUMNS (e.g. ('strain', 'name')).
This should use the first field as the id field and the metadata should not have an extra strain or name column.
$ ${AUGUR} parse \
> --sequences parse/zika.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields col1 virus col3 date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
$ diff -u "parse/sequences.fasta" "$TMP/sequences.fasta"
$ diff "parse/metadata.tsv" "$TMP/metadata.tsv" | tr '>' '+' | tr '<' '-'
1c1
- strain\tvirus\taccession\tdate\tregion\tcountry\tdivision\tcity\tdb\tsegment\tauthors\turl\ttitle\tjournal\tpaper_url (esc)
---
+ col1\tvirus\tcol3\tdate\tregion\tcountry\tdivision\tcity\tdb\tsegment\tauthors\turl\ttitle\tjournal\tpaper_url (esc)
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"
Parse compressed Zika sequences into sequences and metadata.
$ ${AUGUR} parse \
> --sequences parse/zika.fasta.gz \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields strain virus accession date region country division city db segment authors url title journal paper_url \
> --prettify-fields region country division city \
> --fix-dates monthfirst
$ diff -u "parse/sequences.fasta" "$TMP/sequences.fasta"
$ diff -u "parse/metadata.tsv" "$TMP/metadata.tsv"
$ rm -f "$TMP/sequences.fasta" "$TMP/metadata.tsv"
Error on the first duplicate.
$ cat >$TMP/data.fasta <<~~
> >SEQ1
> AAA
> >SEQ1
> AAA
> >SEQ2
> AAA
> >SEQ2
> AAA
> ~~
$ ${AUGUR} parse \
> --sequences $TMP/data.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields strain
ERROR: Duplicate found for 'SEQ1'.
[2]
Run without --fix-dates. The date is left unchanged.
$ cat >$TMP/data.fasta <<~~
> >SEQ1|05/01/2020
> AAA
> ~~
$ ${AUGUR} parse \
> --sequences $TMP/data.fasta \
> --output-sequences "$TMP/sequences.fasta" \
> --output-metadata "$TMP/metadata.tsv" \
> --fields strain date
$ cat "$TMP/metadata.tsv"
strain date
SEQ1 05/01/2020
$ rm -f "$TMP/metadata.tsv"
$ popd > /dev/null
|