File: tutorial.md

package info (click to toggle)
seqkit 0.15.0%2Bds-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 67,984 kB
  • sloc: sh: 928; perl: 114; makefile: 14
file content (464 lines) | stat: -rw-r--r-- 17,691 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
# Tutorial

## Some manipulations on big genomes

A script [memusg](https://github.com/shenwei356/memusg) is
used to check the peek memory usage of seqkit. Usage: `memusg [-t] command`.

1. Human genome

        $ seqkit stat hsa.fa
        file    format  type  num_seqs        sum_len  min_len       avg_len      max_len
        hsa.fa  FASTA   DNA        194  3,099,750,718      970  15,978,096.5  248,956,422

1. Build FASTA index (***optional***, when using flag `-2` (`--two-pass`),
   some commands will automaticlly build it).
   For some commands, including `subseq`, `split`, `sort` and `shuffle`,
   when input files are (plain or gzipped) FASTA files or stdin,
   FASTA index would be optional used for
   rapid acccess of sequences and reducing memory occupation.
   ***ATTENTION***: the `.seqkit.fai` file created by SeqKit is a little different from .fai file
   created by samtools. SeqKit uses full sequence head instead of just ID as key.

        $ memusg -t seqkit faidx --id-regexp "^(.+)$"  hsa.fa -o hsa.fa.seqkit.fai

        elapsed time: 10.011s
        peak rss: 177.21 MB

    Create common .fai file:

        $ memusg -t seqkit faidx hsa.fa -o hsa.fa.fai2

        elapsed time: 10.454s
        peak rss: 172.82 MB


    Performance of samtools:

        $ memusg -t samtools faidx hsa.fa

        elapsed time: 9.574s
        peak rss: 1.45 MB

    Exactly same content:

        $ md5sum hsa.fa.fai*
        21e0c25b4d817d1c19ee8107191b9b31  hsa.fa.fai
        21e0c25b4d817d1c19ee8107191b9b31  hsa.fa.fai2

1. Sorting by sequence length

        $ memusg -t seqkit sort --by-length --reverse --two-pass hsa.fa > hsa.sorted.fa
        [INFO] create and read FASTA index ...
        [INFO] read sequence IDs and lengths from FASTA index ...
        [INFO] 194 sequences loaded
        [INFO] sorting ...
        [INFO] output ...

        elapsed time: 4.892s
        peak rss: 500.15 MB

    Detail:

        $ seqkit fx2tab --length hsa.sorted.fa --name --only-id | cut -f 1,4 | more
        1       248956422
        2       242193529
        3       198295559
        4       190214555
        5       181538259
        6       170805979
        7       159345973
        X       156040895
        8       145138636
        9       138394717
        11      135086622
        10      133797422
        12      133275309
        13      114364328
        14      107043718
        15      101991189
        16      90338345
        17      83257441
        18      80373285
        20      64444167
        19      58617616
        Y       57227415
        22      50818468
        21      46709983
        KI270728.1      1872759
        KI270727.1      448248
        ...

        real    0m10.697s
        user    0m11.153s
        sys     0m0.917s

1. Shuffling sequences

        $ memusg -t seqkit shuffle hsa.fa --two-pass > hsa.shuffled.fa
        [INFO] create and read FASTA index ...
        [INFO] read sequence IDs from FASTA index ...
        [INFO] 194 sequences loaded
        [INFO] shuffle ...
        [INFO] output ...

        elapsed time: 6.632s
        peak rss: 528.3 MB


1. Spliting into files with single sequence

        $ memusg -t seqkit split --by-id hsa.fa --two-pass
        [INFO] split by ID. idRegexp: ^([^\s]+)\s?
        [INFO] create and read FASTA index ...
        [INFO] read sequence IDs from FASTA index ...
        [INFO] 194 sequences loaded
        [INFO] write 1 sequences to file: hsa.id_KI270743.1.fa
        [INFO] write 1 sequences to file: hsa.id_KI270706.1.fa
        [INFO] write 1 sequences to file: hsa.id_KI270717.1.fa
        [INFO] write 1 sequences to file: hsa.id_KI270718.1.fa
        [INFO] write 1 sequences to file: hsa.id_KI270468.1.fa
        ...

        elapsed time: 18.807s
        peak rss: 1.36 GB

1. Geting subsequence of some chromesomes

        $ memusg -t seqkit subseq -r 1:10 --chr X --chr Y  hsa.fa
        >X_1-10 X dna_sm:chromosome chromosome:GRCh38:X:1:156040895:1 REF
        nnnnnnnnnn
        >Y_1-10 Y dna_sm:chromosome chromosome:GRCh38:Y:2781480:56887902:1 REF
        NNNNNNNNNN

        elapsed time: 1.276s
        peak rss: 640.92 MB


1. Geting CDS sequence of chr 1 by GTF files

        $ memusg -t seqkit subseq --gtf Homo_sapiens.GRCh38.84.gtf.gz --chr X --feature cds  hsa.fa > chrX.gtf.cds.fa
        [INFO] read GTF file ...
        [INFO] 22420 GTF features loaded

        elapsed time: 8.643s
        peak rss: 846.14 MB


## Remove contaminated reads

1. Mapping with reads on some potential contaminate genomes, and get the reads IDs list.

        $ wc -l contaminate.list
        244 contaminate.list

        $ head -n 2 contaminate.list
        HWI-D00523:240:HF3WGBCXX:1:1101:2574:2226
        HWI-D00523:240:HF3WGBCXX:1:1101:12616:2205

1. Remove contaminated reads

        $ seqkit grep -f contaminate.list -v reads_1.fq.gz -o reads_1.clean.fq.gz
        $ seqkit grep -f contaminate.list -v reads_2.fq.gz -o reads_2.clean.fq.gz

        $ seqkit stat *.fq.gz
        file                  seq_format   seq_type   num_seqs   min_len   avg_len   max_len
        reads_1.clean.fq.gz   FASTQ        DNA           2,256       226       227       229
        reads_1.fq.gz         FASTQ        DNA           2,500       226       227       229
        reads_2.clean.fq.gz   FASTQ        DNA           2,256       223       224       225
        reads_2.fq.gz         FASTQ        DNA           2,500       223       224       225



## Handling of aligned sequences

1. Some mock sequences (usually they will be much longer)

        $ cat seqs.fa
        >seq1
        ACAACGTCTACTTACGTTGCATCGTCATGCTGCATTACGTAGTCTGATGATG
        >seq2
        ACACCGTCTACTTTCATGCTGCATTACGTAGTCTGATGATG
        >seq3
        ACAACGTCTACTTACGTTGCATCGTCATGCTGCACTGATGATG
        >seq4
        ACAACGTCTACTTACGTTGCATCTTCGGTCATGCTGCATTACGTAGTCTGATGATG

1. Run multiple sequence alignment (clustalo)

        clustalo -i seqs.fa -o seqs.msa.fa --force --outfmt fasta --threads=4

1. Convert FASTA format to tabular format.

        $ seqkit fx2tab seqs.msa.fa
        seq1    ACAACGTCTACTTACGTTGCAT----CGTCATGCTGCATTACGTAGTCTGATGATG
        seq2    ---------------ACACCGTCTACTTTCATGCTGCATTACGTAGTCTGATGATG
        seq3    ACAACGTCTACTTACGTTGCATCGTCATGCTGCACTGATGATG-------------
        seq4    ACAACGTCTACTTACGTTGCATCTTCGGTCATGCTGCATTACGTAGTCTGATGATG

    or

        $ seqkit fx2tab seqs.msa.fa | cut -f 2
        ACAACGTCTACTTACGTTGCAT----CGTCATGCTGCATTACGTAGTCTGATGATG
        ---------------ACACCGTCTACTTTCATGCTGCATTACGTAGTCTGATGATG
        ACAACGTCTACTTACGTTGCATCGTCATGCTGCACTGATGATG-------------
        ACAACGTCTACTTACGTTGCATCTTCGGTCATGCTGCATTACGTAGTCTGATGATG

    For me, it's useful when 1) manually assembling Sanger sequencing result,
    2) designing site specific PCR primers.


1. Remove gaps

        $ seqkit seq seqs.msa.fa -g
        >seq1
        ACAACGTCTACTTACGTTGCATCGTCATGCTGCATTACGTAGTCTGATGATG
        >seq2
        ACACCGTCTACTTTCATGCTGCATTACGTAGTCTGATGATG
        >seq3
        ACAACGTCTACTTACGTTGCATCGTCATGCTGCACTGATGATG
        >seq4
        ACAACGTCTACTTACGTTGCATCTTCGGTCATGCTGCATTACGTAGTCTGATGATG



## Play with miRNA hairpins

### Dataset

[`hairpin.fa.gz`](ftp://mirbase.org/pub/mirbase/21/hairpin.fa.gz)
from [The miRBase Sequence Database -- Release 21](ftp://mirbase.org/pub/mirbase/21/)


### Quick glance

1. Sequence number

        $ seqkit stat hairpin.fa.gz
        file           format  type  num_seqs    sum_len  min_len  avg_len  max_len
        hairpin.fa.gz  FASTA   RNA     28,645  2,949,871       39      103    2,354

1. First 10 bases

        $ zcat hairpin.fa.gz \
            | seqkit subseq -r 1:10 \
            | seqkit sort -s
            | seqkit seq -s \
            | head -n 10
        AAAAAAAAAA
        AAAAAAAAAA
        AAAAAAAAAG
        AAAAAAAAAG
        AAAAAAAAAG
        AAAAAAAAAU
        AAAAAAAAGG
        AAAAAAACAU
        AAAAAAACGA
        AAAAAAAUUA

    hmm, nothing special, non-coding RNA~


### Repeated hairpin sequences

We may want to check how may identical hairpins among different species there are.
`seqkit rmdup` could remove duplicated sequences by sequence content,
and save the replicates to another file (here is `duplicated.fa.gz`),
as well as replicating details (`duplicated.detail.txt`,
1th column is the repeated number,
2nd column contains sequence IDs seperated by comma).

    $ seqkit rmdup -s -i hairpin.fa.gz -o clean.fa.gz -d duplicated.fa.gz -D duplicated.detail.txt

    $ head -n 5 duplicated.detail.txt
    18      dre-mir-430c-1, dre-mir-430c-2, dre-mir-430c-3, dre-mir-430c-4, dre-mir-430c-5, dre-mir-430c-6, dre-mir-430c-7, dre-mir-430c-8, dre-mir-430c-9, dre-mir-430c-10, dre-mir-430c-11, dre-mir-430c-12, dre-mir-430c-13, dre-mir-430c-14, dre-mir-430c-15, dre-mir-430c-16, dre-mir-430c-17, dre-mir-430c-18
    16      hsa-mir-29b-2, mmu-mir-29b-2, rno-mir-29b-2, ptr-mir-29b-2, ggo-mir-29b-2, ppy-mir-29b-2, sla-mir-29b, mne-mir-29b, ppa-mir-29b-2, bta-mir-29b-2, mml-mir-29b-2, eca-mir-29b-2, aja-mir-29b, oar-mir-29b-1, oar-mir-29b-2, rno-mir-29b-3
    15      dme-mir-125, dps-mir-125, dan-mir-125, der-mir-125, dgr-mir-125-1, dgr-mir-125-2, dmo-mir-125, dpe-mir-125-2, dpe-mir-125-1, dpe-mir-125-3, dse-mir-125, dsi-mir-125, dvi-mir-125, dwi-mir-125, dya-mir-125
    13      hsa-mir-19b-1, ggo-mir-19b-1, age-mir-19b-1, ppa-mir-19b-1, ppy-mir-19b-1, ptr-mir-19b-1, mml-mir-19b-1, sla-mir-19b-1, lla-mir-19b-1, mne-mir-19b-1, bta-mir-19b, oar-mir-19b, chi-mir-19b
    13      hsa-mir-20a, ssc-mir-20a, ggo-mir-20a, age-mir-20, ppa-mir-20, ppy-mir-20a, ptr-mir-20a, mml-mir-20a, sla-mir-20, lla-mir-20, mne-mir-20, bta-mir-20a, eca-mir-20a

The result shows the most conserved miRNAs among different species,
`mir-29b`, `mir-125`, `mir-19b-1` and `mir-20a`.
And the `dre-miR-430c` has the most multicopies in *Danio rerio*.

### Hairpins in different species

1. Before spliting by species, let's take a look at the sequence names.

        $ seqkit seq hairpin.fa.gz -n | head -n 3
        cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop
        cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop
        cel-mir-1 MI0000003 Caenorhabditis elegans miR-1 stem-loop

    The first three letters (e.g. `cel`) are the abbreviation of species names.
    So we could split hairpins by the first letters by defining custom
    sequence ID parsing regular expression `^([\w]+)\-`.

    By default, `seqkit` takes the first non-space letters as sequence ID.
    For example,

    |   FASTA head                                                  |     ID                                            |
    |:--------------------------------------------------------------|:--------------------------------------------------|
    | >123456 gene name                                             | 123456                                            |
    | >longname                                                     | longname                                          |
    | >gi|110645304|ref|NC_002516.2| Pseudomona | gi|110645304|ref|NC_002516.2| |

    But for some sequences from NCBI,
    e.g. `>gi|110645304|ref|NC_002516.2| Pseudomona`, the ID is `NC_002516.2`.
    In this case, we could set sequence ID parsing regular expression by flag
    `--id-regexp "\|([^\|]+)\| "` or just use flag `--id-ncbi`. If you want
    the `gi` number, then use `--id-regexp "^gi\|([^\|]+)\|"`.

1. Split sequences by species.
A custom ID parsing regular expression is used, `^([\w]+)\-`.

        $ seqkit split hairpin.fa.gz -i --id-regexp "^([\w]+)\-" --two-pass

    ***To reduce memory usage when splitting big file, we should always use flag `--two-pass`***

2. Species with most miRNA hairpins. Third column is the sequences number.

        $ cd hairpin.fa.gz.split/;
        $ seqkit stat hairpin.id_* \
            | csvtk space2tab \
            | csvtk -t sort -k num_seqs:nr \
            | csvtk -t pretty \
            | more
        file                     format   type   num_seqs   sum_len   min_len   avg_len   max_len
        hairpin.id_hsa.fasta     FASTA    RNA    1,881      154,242   82        82        82
        hairpin.id_mmu.fasta     FASTA    RNA    1,193      107,370   90        90        90
        hairpin.id_bta.fasta     FASTA    RNA    808        61,408    76        76        76
        hairpin.id_gga.fasta     FASTA    RNA    740        42,180    57        57        57
        hairpin.id_eca.fasta     FASTA    RNA    715        89,375    125       125       125
        hairpin.id_mtr.fasta     FASTA    RNA    672        231,840   345       345       345

    Here, a CSV/TSV tool [csvtk](https://github.com/shenwei356/csvtk)
    is used to sort and view the result.

For human miRNA hairpins

1. Length distribution.
 `seqkit fx2tab` could show extra information like sequence length, GC content.
 [`csvtk`](http://bioinf.shenwei.me/csvtk/) is used to plot.

        $ seqkit grep -r -p '^hsa' hairpin.fa.gz  \
            | seqkit fx2tab -l \
            | cut -f 4  \
            | csvtk -H plot hist --xlab Length --title "Human pre-miRNA length distribution"

    ![hairpin.id_hsa.fa.gz.lendist.png](/files/hairpin/hairpin.id_hsa.fa.gz.lendist.png)

        $ seqkit grep -r -p '^hsa' hairpin.fa.gz \
            | seqkit fx2tab -l \
            | cut -f 4 \
            | csvtk -H plot box --xlab Length --horiz --height 1.5
        
    ![hairpin.id_hsa.fa.gz.lenbox.png](/files/hairpin/hairpin.id_hsa.fa.gz.lenbox.png)

## Bacteria genome

### Dataset

[Pseudomonas aeruginosa PAO1](http://www.ncbi.nlm.nih.gov/nuccore/110645304),
files:

-  Genbank file [`PAO1.gb`](/files/PAO1/PAO1.gb)
-  Genome FASTA file [`PAO1.fasta`](/files/PAO1/PAO1.fasta)
-  GTF file [`PAO1.gtf`](/files/PAO1/PAO1.gtf) was created with [`extract_features_from_genbank_file.py`](https://github.com/shenwei356/bio_scripts/blob/master/file_formats/extract_features_from_genbank_file.py), by

        extract_features_from_genbank_file.py  PAO1.gb -t . -f gtf > PAO1.gtf


### Motif distribution

Motifs

    $ cat motifs.fa
    >GTAGCGS
    GTAGCGS
    >GGWGKTCG
    GGWGKTCG

1. Sliding. Remember flag `--id-ncbi`, do you?
  By the way, do not be scared by the long flag `--circle-genome`, `--step`
  and so on. They have short ones, `-c`, `-s`

        $ seqkit sliding --id-ncbi --circular-genome \
            --step 20000 --window 200000 PAO1.fasta -o PAO1.fasta.sliding.fa

        $ seqkit stat PAO1.fasta.sliding.fa
        file                   format  type  num_seqs     sum_len  min_len  avg_len  max_len
        PAO1.fasta.sliding.fa  FASTA   DNA        314  62,800,000  200,000  200,000  200,000

1. Locating motifs

        $ seqkit locate --id-ncbi --ignore-case --degenerate \
            --pattern-file motifs.fa  PAO1.fasta.sliding.fa -o  PAO1.fasta.sliding.fa.motifs.tsv

1. Ploting distribution ([plot_motif_distribution.R](/files/PAO1/plot_motif_distribution.R))

        # preproccess
        $ perl -ne 'if (/_sliding:(\d+)-(\d+)\t(.+)/) {$loc= $1 + 100000; print "$loc\t$3\n";} else {print}' PAO1.fasta.sliding.fa.motifs.tsv  > PAO1.fasta.sliding.fa.motifs.tsv2

        # plot
        $ ./plot_motif_distribution.R

    Result

    ![motif_distribution.png](files/PAO1/motif_distribution.png)


### Find multicopy genes

1. Get all CDS sequences

        $ seqkit subseq --id-ncbi --gtf PAO1.gtf --feature cds PAO1.fasta -o PAO1.cds.fasta

        $ seqkit stat *.fasta
        file            format  type  num_seqs    sum_len    min_len    avg_len    max_len
        PAO1.cds.fasta  FASTA   DNA      5,572  5,593,306         72    1,003.8     16,884
        PAO1.fasta      FASTA   DNA          1  6,264,404  6,264,404  6,264,404  6,264,404

1. Get duplicated sequences

        $ seqkit rmdup --by-seq --ignore-case PAO1.cds.fasta -o PAO1.cds.uniq.fasta \
            --dup-seqs-file PAO1.cds.dup.fasta --dup-num-file PAO1.cds.dup.text

        $ cat PAO1.cds.dup.text
        6       NC_002516.2_500104:501120:-, NC_002516.2_2556948:2557964:+, NC_002516.2_3043750:3044766:-, NC_002516.2_3842274:3843290:-, NC_002516.2_4473623:4474639:+, NC_002516.2_5382796:5383812:-
        2       NC_002516.2_2073555:2075438:+, NC_002516.2_4716660:4718543:+
        2       NC_002516.2_2072935:2073558:+, NC_002516.2_4716040:4716663:+
        2       NC_002516.2_2075452:2076288:+, NC_002516.2_4718557:4719393:+

### Flanking sequences

1. Get CDS and 1000 bp upstream sequence

        $ seqkit subseq --id-ncbi --gtf PAO1.gtf \
            --feature cds PAO1.fasta --up-stream 1000

1. Get 1000 bp upstream sequence of CDS, *NOT* including CDS.

        $ seqkit subseq --id-ncbi --gtf PAO1.gtf \
            --feature cds PAO1.fasta --up-stream 1000 --only-flank

<div id="disqus_thread"></div>
<script>

/**
*  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
*  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/
/*
var disqus_config = function () {
this.page.url = PAGE_URL;  // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
*/
(function() { // DON'T EDIT BELOW THIS LINE
var d = document, s = d.createElement('script');
s.src = '//seqkit.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>