File: recipes.rst

package info (click to toggle)
python-cutadapt 3.2-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 1,724 kB
  • sloc: python: 6,028; makefile: 167; sh: 27
file content (301 lines) | stat: -rw-r--r-- 11,821 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
===============
Recipes and FAQ
===============

This section gives answers to frequently asked questions. It shows you how to
get Cutadapt to do what you want it to do!


Remove more than one adapter
----------------------------

If you want to remove a 5' and 3' adapter at the same time, :ref:`use the
support for linked adapters <linked-adapters>`.

If your situation is different, for example, when you have many 5' adapters
but only one 3' adapter, then you have two options.

First, you can specify the adapters and also ``--times=2`` (or the short
version ``-n 2``). For example::

    cutadapt -g ^TTAAGGCC -g ^AAGCTTA -a TACGGACT -n 2 -o output.fastq input.fastq

This instructs Cutadapt to run two rounds of adapter finding and removal. That
means that, after the first round and only when an adapter was actually found,
another round is performed. In both rounds, all given adapters are searched and
removed. The problem is that it could happen that one adapter is found twice (so
the 3' adapter, for example, could be removed twice).

The second option is to not use the ``-n`` option, but to run Cutadapt twice,
first removing one adapter and then the other. It is easiest if you use a pipe
as in this example::

    cutadapt -g ^TTAAGGCC -g ^AAGCTTA input.fastq | cutadapt -a TACGGACT - > output.fastq


Trim poly-A tails
-----------------

If you want to trim a poly-A tail from the 3' end of your reads, use the 3'
adapter type (``-a``) with an adapter sequence of many repeated ``A``
nucleotides. Starting with version 1.8 of Cutadapt, you can use the
following notation to specify a sequence that consists of 100 ``A``::

    cutadapt -a "A{100}" -o output.fastq input.fastq

This also works when there are sequencing errors in the poly-A tail. So this
read ::

    TACGTACGTACGTACGAAATAAAAAAAAAAA

will be trimmed to::

    TACGTACGTACGTACG

If for some reason you would like to use a shorter sequence of ``A``, you can
do so: The matching algorithm always picks the leftmost match that it can find,
so Cutadapt will do the right thing even when the tail has more ``A`` than you
used in the adapter sequence. However, sequencing errors may result in shorter
matches than desired. For example, using ``-a "A{10}"``, the read above (where
the ``AAAT`` is followed by eleven ``A``) would be trimmed to::

    TACGTACGTACGTACGAAAT

Depending on your application, perhaps a variant of ``-a A{10}N{90}`` is an
alternative, forcing the match to be located as much to the left as possible,
while still allowing for non-``A`` bases towards the end of the read.


Trim a fixed number of bases after adapter trimming
---------------------------------------------------

If the adapters you want to remove are preceded by some unknown sequence (such
as a random tag/molecular identifier), you can specify this as part of the
adapter sequence in order to remove both in one go.

For example, assume you want to trim Illumina adapters preceded by 10 bases
that you want to trim as well. Instead of this command::

    cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ...

Use this command::

    cutadapt -O 13 -a N{10}AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ...

The ``-O 13`` is the minimum overlap for an adapter match, where the 13 is
computed as 3 plus 10 (where 3 is the default minimum overlap and 10 is the
length of the unknown section). If you do not specify it, the adapter sequence
would match the end of every read (because ``N`` matches anything), and ten
bases would then be removed from every read.


Trimming (amplicon-) primers from both ends of paired-end reads
---------------------------------------------------------------

If you want to remove primer sequences that flank your sequence of
interest, you should use a :ref:`"linked adapter" <linked-adapters>`
to remove them. If you have paired-end data (with R1 and R2), you
can correctly trim both R1 and R2 by using linked adapters for both
R1 and R2. Here is how to do this.

The full DNA fragment that is put on the sequencer looks like this
(looking only at the forward strand):

   5' sequencing primer -- forward primer -- sequence of interest -- reverse complement of reverse primer -- reverse complement of 3' sequencing primer

Since sequencing of R1 starts after the 5' sequencing primer, R1 will
start with the forward primer and then continue into the sequence of
interest and into the two primers to the right of it, depending on
the read length and how long the sequence of interest is. For R1,
the linked adapter option that needs to be used is therefore ::

    -a FWDPRIMER...RCREVPRIMER

where ``FWDPRIMER`` needs to be replaced with the sequence of your
forward primer and ``RCREVPRIMER`` with the reverse complement of
the reverse primer. The three dots ``...`` need to be entered
as they are -- they tell Cutadapt that this is a linked adapter
with a 5' and a 3' part.

Sequencing of R2 starts before the 3' sequencing primer and
proceeds along the reverse-complementary strand. For the correct
linked adapter, the sequences from above therefore need to be
swapped and reverse-complemented::

    -A REVPRIMER...RCFWDPRIMER

The uppercase ``-A`` specifies that this option is
meant to work on R2. Similar to above, ``REVPRIMER`` is
the sequence of the reverse primer and ``RCFWDPRIMER`` is the
reverse-complement of the forward primer. Note that Cutadapt
does not reverse-complement any sequences of its own; you
will have to do that yourself.

Finally, you may want to filter the trimmed read pairs.
Use ``--discard-untrimmed`` to throw away all read pairs in
which R1 doesn’t start with ``FWDPRIMER`` or in which R2
does not start with ``REVPRIMER``.

A note on how the filtering works: In linked adapters, by default
the first part (before the ``...``) is anchored. Anchored
sequences *must* occur. If they don’t, then the other sequence
(after the ``...``) is not even searched for and the entire
read is internally marked as “untrimmed”. This is done for both
R1 and R2 and as soon as *any* of them is marked as “untrimmed”,
the entire pair is considered to be “untrimmed”. If
``--discard-untrimmed`` is used, this means that the entire
pair is discarded if R1 or R2 are untrimmed. (Option
``--pair-filter=both`` can be used to change this to require
that *both* were marked as untrimmed.)

In summary, this is how to trim your data and discard all
read pairs that do not contain the primer sequences that
you know must be there::

    cutadapt -a FWDPRIMER...RCREVPRIMER -A REVPRIMER...RCFWDPRIMER --discard-untrimmed -o out.1.fastq.gz -p out.2.fastq.gz in.1.fastq.gz in.2.fastq.gz


Piping paired-end data
----------------------

Sometimes it is necessary to run Cutadapt twice on your data. For example, when
you want to change the order in which read modification or filtering options are
applied. To simplify this, you can use Unix pipes (``|``), but this is more
difficult with paired-end data since then input and output consists of two files
each.

The solution is to interleave the paired-end data, send it over the pipe
and then de-interleave it in the other process. Here is how this looks in
principle::

    cutadapt [options] --interleaved in.1.fastq.gz in.2.fastq.gz | \
      cutadapt [options] --interleaved -o out.1.fastq.gz -p out.2.fastq.gz -

Note the ``-`` character in the second invocation to Cutadapt.


Support for concatenated compressed files
-----------------------------------------

Cutadapt supports concatenated gzip and bzip2 input files.


Paired-end read name check
--------------------------

When reading paired-end files, Cutadapt checks whether the read names match.
Only the part of the read name before the first space is considered. If the
read name ends with ``1`` or ``2``, then that is also ignored. For example,
two FASTQ headers that would be considered to denote properly paired reads are::

    @my_read/1 a comment

and::

    @my_read/2 another comment

This is an example for *improperly paired* read names::

    @my_read/1;1

and::

    @my_read/2;1

Since the ``1`` and ``2`` are ignored only if the occur at the end of the read
name, and since the ``;1`` is considered to be part of the read name, these
reads will not be considered to be propely paired.


Rescuing single reads from paired-end reads that were filtered
--------------------------------------------------------------

When trimming and filtering paired-end reads, Cutadapt always discards entire read pairs. If you
want to keep one of the reads, you need to write the filtered read pairs to an output file and
postprocess it.

For example, assume you are using ``-m 30`` to discard too short reads. Cutadapt discards all
read pairs in which just one of the reads is too short (but see the ``--pair-filter`` option).
To recover those (individual) reads that are long enough, you can first use the
``--too-short-(paired)-output`` options to write the filtered pairs to a file, and then postprocess
those files to keep only the long enough reads.


    cutadapt -m 30 -q 20 -o out.1.fastq.gz -p out.2.fastq.gz --too-short-output=tooshort.1.fastq.gz --too-short-paired-output=tooshort.2.fastq.gz in.1.fastq.gz in.2.fastq.gz
    cutadapt -m 30 -o rescued.a.fastq.gz tooshort.1.fastq.gz
    cutadapt -m 30 -o rescued.b.fastq.gz tooshort.2.fastq.gz

The two output files ``rescued.a.fastq.gz`` and ``rescued.b.fastq.gz`` contain those individual
reads that are long enough. Note that the file names do not end in ``.1.fastq.gz`` and
``.2.fastq.gz`` to make it very clear that these files no longer contain synchronized paired-end
reads.


.. _bisulfite:

Bisulfite sequencing (RRBS)
---------------------------

When trimming reads that come from a library prepared with the RRBS (reduced
representation bisulfite sequencing) protocol, the last two 3' bases must be
removed in addition to the adapter itself. This can be achieved by using not
the adapter sequence itself, but by adding two wildcard characters to its
beginning. If the adapter sequence is ``ADAPTER``, the command for trimming
should be::

    cutadapt -a NNADAPTER -o output.fastq input.fastq

Details can be found in `Babraham bioinformatics' "Brief guide to
RRBS" <http://www.bioinformatics.babraham.ac.uk/projects/bismark/RRBS_Guide.pdf>`_.
A summary follows.

During RRBS library preparation, DNA is digested with the restriction enzyme
MspI, generating a two-base overhang on the 5' end (``CG``). MspI recognizes
the sequence ``CCGG`` and cuts
between ``C`` and ``CGG``. A double-stranded DNA fragment is cut in this way::

    5'-NNNC|CGGNNN-3'
    3'-NNNGGC|CNNN-5'

The fragment between two MspI restriction sites looks like this::

    5'-CGGNNN...NNNC-3'
      3'-CNNN...NNNGGC-5'

Before sequencing (or PCR) adapters can be ligated, the missing base positions
must be filled in with GTP and CTP::

    5'-ADAPTER-CGGNNN...NNNCcg-ADAPTER-3'
    3'-ADAPTER-gcCNNN...NNNGGC-ADAPTER-5'

The filled-in bases, marked in lowercase above, do not contain any original
methylation information, and must therefore not be used for methylation calling.
By prefixing the adapter sequence with ``NN``, the bases will be automatically
stripped during adapter trimming.


.. _file-format-conversion:

File format conversion
----------------------

You can use Cutadapt to convert FASTQ to FASTA format::

    cutadapt -o output.fasta.gz input.fastq.gz

Cutadapt detects that the file name extension of the output file is ``.fasta``
and writes in FASTA format, omitting the qualities.

When writing to standard output, you need to use the ``--fasta`` option::

    cutadapt --fasta input.fastq.gz > out.fasta

Without the option, Cutadapt writes in FASTQ format.


Other things (unfinished)
-------------------------

* How to detect adapters
* Use Cutadapt for quality-trimming only
* Use it for minimum/maximum length filtering