File: filter-query-example.t

package info (click to toggle)
augur 24.4.0-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 25,312 kB
  • sloc: python: 14,253; sh: 227; makefile: 35
file content (63 lines) | stat: -rw-r--r-- 2,338 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
Setup

  $ source "$TESTDIR"/_setup.sh

Filter into two separate sets and then select sequences from the union of those sets.
First, select strains from Brazil (there should be 1).

  $ ${AUGUR} filter \
  >  --metadata "$TESTDIR/../data/metadata.tsv" \
  >  --query "country == 'Brazil'" \
  >  --output-strains filtered_strains.brazil.txt 2>/dev/null
  $ wc -l filtered_strains.brazil.txt
  \s*1 .* (re)

Then, select strains from Colombia (there should be 3).

  $ ${AUGUR} filter \
  >  --metadata "$TESTDIR/../data/metadata.tsv" \
  >  --query "country == 'Colombia'" \
  >  --output-strains filtered_strains.colombia.txt 2>/dev/null
  $ wc -l filtered_strains.colombia.txt
  \s*3 .* (re)

Finally, exclude all sequences except those from the two sets of strains (there should be 4).

  $ ${AUGUR} filter \
  >  --sequences "$TESTDIR/../data/sequences.fasta" \
  >  --sequence-index "$TESTDIR/../data/sequence_index.tsv" \
  >  --metadata "$TESTDIR/../data/metadata.tsv" \
  >  --exclude-all \
  >  --include filtered_strains.brazil.txt filtered_strains.colombia.txt \
  >  --output filtered.fasta 2>/dev/null
  $ grep "^>" filtered.fasta | wc -l
  \s*4 (re)

Repeat this filter without a sequence index.
We should get the same outputs without building a sequence index on the fly, because the exclude-all flag tells us we only want to force-include strains and skip all other filters.

  $ ${AUGUR} filter \
  >  --sequences "$TESTDIR/../data/sequences.fasta" \
  >  --metadata "$TESTDIR/../data/metadata.tsv" \
  >  --exclude-all \
  >  --include filtered_strains.brazil.txt filtered_strains.colombia.txt \
  >  --output filtered.fasta \
  >  --output-metadata filtered.tsv 2>/dev/null
  $ grep "^>" filtered.fasta | wc -l
  \s*4 (re)

Metadata should have the same number of records as the sequences plus a header.

  $ wc -l filtered.tsv
  \s*5 .* (re)

Alternately, exclude the sequences from Brazil and Colombia (N=4) and records without sequences (N=1) or metadata (N=1).

  $ ${AUGUR} filter \
  >  --sequences "$TESTDIR/../data/sequences.fasta" \
  >  --sequence-index "$TESTDIR/../data/sequence_index.tsv" \
  >  --metadata "$TESTDIR/../data/metadata.tsv" \
  >  --exclude filtered_strains.brazil.txt filtered_strains.colombia.txt \
  >  --output filtered.fasta 2>/dev/null
  $ grep "^>" filtered.fasta | wc -l
  \s*7 (re)