File: maf-convert.rst

package info (click to toggle)
last-align 1651-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 14,692 kB
  • sloc: cpp: 44,419; python: 5,217; ansic: 1,938; sh: 710; makefile: 457
file content (146 lines) | stat: -rw-r--r-- 5,554 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
maf-convert
===========

This script reads alignments in maf_ format, and writes them in
another format.  It can write them in these formats: axt_, bed_,
blast, blasttab, blasttab+, chain_, gff, html, psl_, sam, tab.  You
can use it like this::

  maf-convert psl my-alignments.maf > my-alignments.psl

It's often convenient to pipe in the input, like this::

  ... | maf-convert psl > my-alignments.psl

This script takes the first (topmost) maf sequence as the "reference"
/ "subject", and the second sequence as the "query".
(Exception: when converting DNA-to-protein alignments to bed, gff or psl,
the protein becomes the "query" and the DNA becomes the "reference".)

For html: if the input includes probability lines starting with 'p',
then the output will be colored by column probability.  (To get lines
starting with 'p', run lastal with option -j set to 4 or higher.)

.. _maf: http://genome.ucsc.edu/FAQ/FAQformat.html#format5
.. _axt: https://genome.ucsc.edu/goldenPath/help/axt.html
.. _bed: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
.. _chain: https://genome.ucsc.edu/goldenPath/help/chain.html
.. _psl: https://genome.ucsc.edu/FAQ/FAQformat.html#format2

Options
-------

-h, --help
       Print a help message and exit.

-s N, --subject=N
       Use the Nth sequence in each alignment as the "reference" /
       "subject".  This option affects these formats: bed, blast,
       blasttab, gff, psl.

-p, --protein
       Specify that the alignments are of proteins, rather than
       nucleotides.  This affects psl format only (the first 4
       columns), and has no effect for DNA-to-protein alignments.

-j N, --join=N
       Join alignments that are co-linear (align different parts of
       the same sequences and strands, with the parts being in the
       same order in each sequence), are separated by at most N
       letters in each sequence, and are consecutive in the input.
       This affects bed, gff, psl, and sam formats only.

-J N, --Join=N
       Join alignments that are co-linear, are separated by at most
       N letters in each sequence, and are nearest in each sequence.
       This affects bed, gff and psl formats only, and reads the whole
       input into memory.

-n, --noheader
       Omit any header lines from the output.  This may be useful if
       you concatenate outputs, e.g. from parallel jobs.

-d, --dictionary
       Include a dictionary of sequence lengths in the sam header
       section (lines starting with @SQ).  This requires reading the
       input twice, so it must be a real file (not a pipe).  This
       affects sam format only.

-f DICTFILE, --dictfile=DICTFILE
       Get a sequence dictionary from DICTFILE.  This affects sam
       format only.  You can create a dict file using
       CreateSequenceDictionary (http://picard.sourceforge.net/).

-r READGROUP, --readgroup=READGROUP
       Specify read group information.  This affects sam format
       only.  Example: -r 'ID:1 PL:ILLUMINA SM:mysample'

-l CHARS, --linesize=CHARS
       Write CHARS characters per line.  This affects blast and html
       formats only.

DNA-versus-protein blast output
-------------------------------

If the input has protein-versus-DNA alignments like this::

  GluTrpThrAlaLeuIleAsnLeuLysAsnArg--AspLeuValIleLysAlaAlaAsp
  GAATAGTCCGGTTGAAAAAATGTACAAAAACATAAGAGAACATTACAAAACTTGCAGTC

then the blast-format output will look like this::

  Glu***SerGly***LysAsnValGlnLysHis  GluAsnIleThrLysLeuAlaVal
  GAATAGTCCGGTTGAAAAAATGTACAAAAACATAAGAGAACATTACAAAACTTGCAGTC
  |||::::::         |||...:::...:::  :::   :::...|||   |||
  GluTrpThrAlaLeuIleAsnLeuLysAsnArg--AspLeuValIleLysAlaAlaAsp

The DNA's translation (assuming the standard genetic code) is shown
above it.  ``|||`` indicates a match, ``:::`` a positive alignment
score, and ``...`` an alignment score of 0.  ``:::`` and ``...`` are
shown only for alignments preceded by a substitution score matrix of
the sort in lastal's output header.

Hints for sam/bam
-----------------

* To run fast on multiple CPUs, and get a correct header at the top,
  this may be the least-awkward way.  First, make a header (perhaps by
  using CreateSequenceDictionary).  Then, concatenate the output of a
  command like this::

    parallel-fastq "... | maf-convert -n sam" < q.fastq

* Here is yet another way to get a sequence dictionary, using samtools
  (http://samtools.sourceforge.net/).  Assume the reference sequences
  are in ref.fa.  These commands convert x.sam to y.bam while adding a
  sequence dictionary::

    samtools faidx ref.fa
    samtools view -bt ref.fa.fai x.sam > y.bam

* If a query name ends in "/1" or "/2", maf-convert interprets it as a
  paired sequence.  (This affects sam format only.)  However, it does
  not calculate all of the sam pairing information (because it's hard
  and better done by specialized sam manipulators).

  Fix the pair information in y.sam, putting the output in z.bam.
  Using picard::

    java -jar FixMateInformation.jar I=y.sam O=z.bam VALIDATION_STRINGENCY=SILENT

  Alternatively, using samtools::

    samtools sort -n y.bam ysorted
    samtools fixmate ysorted.bam z.bam

"Bugs"
------

* For sam: the QUAL field (column 11) is simply copied from the maf q
  line.  The QUAL is supposed to be encoded as ASCII(phred+33),
  whereas maf q lines are encoded differently according to the UCSC
  Genome FAQ.  However, if you run lastal with option -Q1, the maf q
  lines will in fact be ASCII(phred+33).

* The blast format is merely blast-like: it is not identical to NCBI
  BLAST.