File: debug.rst

package info (click to toggle)
nanopolish 0.14.0-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 11,760 kB
  • sloc: cpp: 22,200; ansic: 1,478; python: 814; makefile: 210; sh: 43; perl: 17
file content (170 lines) | stat: -rw-r--r-- 6,374 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
.. _help_us_debug:

Helping us debug nanopolish
===============================

Overview
"""""""""""""""""""""""

Running into errors with nanopolish? To help us debug, we need to be able to reproduce the errors. We can do this by packaging a subset of the files that were used by a nanopolish. We have provided ``scripts/extract_reads_aligned_to_region.py`` and this tutorial to help you do exactly this.

Briefly, this script will:

* extract reads that align to a given region in the draft genome assembly
* rewrite a new BAM, BAI, FASTA file with these reads
* extract the FAST5 files associated with these reads
* save all these files into a tar.gz file

Workflow
"""""""""""""

#. Narrow down a problematic region by running ``nanopolish variants --consensus [...]`` and changing the ``-w`` parameter.
#. Run the ``scripts/extract_reads_aligned_to_region.py``.
#. Send the resulting ``tar.gz`` file to us by hosting either a dropbox or google drive.

.. _creating_example_dataset:

Tutorial - using extraction helper script to create example datsets
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

We extracted a subset of reads for a 2kb region to create the example dataset for the eventalign and consensus tutorial using ``scripts/extract_reads_aligned_to_region.py``. Here is how:

|

Generated basecalled ``--reads`` file:

#. Basecalled reads with albacore: ::

    read_fast5_basecaller.py -c r94_450bps_linear.cfg -t 8 -i /path/to/raw/fast5/files -s /path/to/albacore-2.0.1/output/directory -o fastq 

#. Merged the different albacore fastq outputs: ::

    cat /path/to/albacore-2.0.1/output/directory/workspace/pass/*.fastq > albacore-2.0.1-merged.fastq

#. Converted the merged fastq to fasta format: ::

    paste - - - - < albacore-2.0.1-merged.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > reads.fasta

|

Generated ``--bam`` file with the draft genome assembly (``-g``):

#. Ran canu to create draft genome assembly: ::

    canu \
        -p ecoli -d outdir genomeSize=4.6m \
        -nanopore-raw reads.fasta \ 

#. Index draft assembly: ::

    bwa index ecoli.contigs.fasta
    samtools faidx ecoli.contigs.fasta

#. Aligned reads to draft genome assembly with bwa (v0.7.12): ::

    bwa mem -x ont2d ecoli.contigs.fasta reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp
    samtools index reads.sorted.bam

|

Selected a ``--window``:

#. Identified the first contig name and chose a random start position: ::

    head -3 ecoli.contigs.fasta

Output: ::

    >tig00000001 len=4376233 reads=23096 covStat=7751.73 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
    AGATGCTTTGAAAGAAACGCAGAATAGATCTCTATGTAATGATATGGAATACTCTGGTATTGTCTGTAAAGATACTAATGGAAAATATTTTGCATCTAAG
    GCAGAAACTGATAATTTAAGAAAGGAGTCATATCCTCTGAAAAGAAAATGTCCCACAGGTACAGATAGAGTTGCTGCTTATCATACTCACGGTGCAGATA
 
As we wanted a 2kb region, we selected a random start position (200000) so our end position was 202000. Therefore our ``--window`` was "tig00000001:200000-202000".

|

Using the files we created, we ran ``scripts/extract_reads_aligned_to_region.py``, please see usage example below.

.. note:: Make sure nanopolish still reproduces the same error on this subset before sending it to us.

Usage example
"""""""""""""""""""""""
::

    python3 extract_reads_aligned_to_region.py \
        --reads reads.fasta \
        --genome ecoli.contigs.fasta \
        --bam reads.sorted.bam \
        --window "tig00000001:200000-202000" \
        --output_prefix ecoli_2kb_region --verbose

.. list-table:: 
   :widths: 25 5 20 50
   :header-rows: 1

   * - Argument name(s)
     - Req.
     - Default value
     - Description

   * - ``-b``, ``--bam``
     - Y
     - NA
     - Sorted bam file created by aligning reads to the draft genome.

   * - ``-g``, ``--genome``
     - Y
     - NA
     - Draft genome assembly

   * - ``-r``, ``--reads``
     - Y
     - NA
     - Fasta, fastq, fasta.gz, or fastq.gz file containing basecalled reads.

   * - ``-w``, ``--window``
     - Y
     - NA
     - Draft genome assembly coordinate string ex. "contig:start-end". It is essential that you wrap the coordinates in quotation marks (\").

   * - ``-o``, ``--output_prefix``
     - N
     - reads_subset
     - Prefix of output tar.gz and log file.

   * - ``-v``, ``--verbose``
     - N
     - False
     - Use for verbose output with info on progress.

Script overview
"""""""""""""""""""""

#. Parse input files
#. Assumes readdb file name from input reads file
#. Validates input
    - checks that input bam, readdb, fasta/q, draft genome assembly, draft genome assembly index file exist, are not empy, and are readable
#. With user input draft genome assembly coordinates, extracts all reads that aligned within these coordinates stores the read_ids. This information can be found from the input BAM.
    - uses pysam.AlignmentFile
    - uses samfile.fetch(region=draft_ga_coords) to get all reads aligned to region
    - if reads map to multiple sections within draft ga it is not added again
#. Parses through the input readdb file to find the FAST5 files associated with each region that aligned to region
    - stores in dictionary region_fast5_files; key = read_id, value = path/to/fast5/file
    - path to fast5 file is currently dependent on the user's directory structure
#. Make a BAM and BAI file for this specific region
    - creates a new BAM file called ``region.bam``
    - with pysam.view we rewrite the new bam with reads that aligned to the region...
    - with pysam.index we create a new BAI file
#. Now to make a new FASTA file with this subset of reads
    - the new fasta file is called ``region.fasta``
    - this first checks what type of sequences file is given { ``fasta``, ``fastq``, ``fasta.gz``, ``fastq.gz`` }
    - then handles based on type of seq file using SeqIO.parse
    - then writes to a new fasta file
#. Let's get to tarring
    - creates a ``tar.gz`` file with the output prefix
    - saves the fast5 files in directory ``output_prefix/fast5_files/``
#. Adds the new fasta, new bam, and new bai file with the subset of reads
#. Adds the draft genome asssembly and associated fai index file
#. Performs a check
    - the number of reads in the new BAM file, new FASTA file, and the number of files in the fast5_files directory should be equal
#. Outputs a ``tar.gz`` and ``log`` file. FIN!