File: anib.py

package info (click to toggle)
python-pyani 0.2.13-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 159,944 kB
  • sloc: python: 3,106; makefile: 86; sh: 30
file content (555 lines) | stat: -rw-r--r-- 21,988 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
# -*- coding: utf-8 -*-
"""Code to implement the ANIb average nucleotide identity method.

Calculates ANI by the ANIb method, as described in Goris et al. (2007)
Int J Syst Evol Micr 57: 81-91. doi:10.1099/ijs.0.64483-0.

From Goris et al.

'''The genomic sequence from one of the genomes in a pair (the query)
was cut into consecutive 1020 nt fragments. The 1020 nt cut-off was used
to correspond with the fragmentation of the genomic DNA to approximately
1 kb fragments during the DDH experiments. [...] The 1020 nt fragments
were then used to search against the whole genomic sequence of the other
genome in the pair (the reference) by using the BLASTN algorithm;
the best BLASTN match was saved for further analysis. The BLAST
algorithm was run using the following settings: X=150 (where X is the
drop-off value for gapped alignment), q=-1 (where q is the penalty
for nucleotide mismatch) and F=F (where F is the filter for repeated
sequences); the rest of the parameters were used at the default settings.
These settings give better sensitivity than the default settings when
more distantly related genomes are being compared, as the latter
target sequences that are more similar to each other.
[...]
The ANI between the query genome and the reference genome was
calculated as the mean identity of all BLASTN matches that showed more
than 30% overall sequence identity (recalculated to an identity along
the entire sequence) over an alignable region of at least 70% of their
length. This cut-off is above the 'twilight zone' of similarity searches in
which an inference of homology is error prone because of low levels of
Reverse searching, i.e. in which the reference genome is used as the
query, was also performed to provide reciprocal values.'''

All input FASTA format files are used to construct BLAST databases.
Each file's contents are also split into sequence fragments of length
options.fragsize, and the multiple FASTA file that results written to
the output directory. These are BLASTNed, pairwise, against the
databases.

BLAST output is interrogated for all fragment matches that cover
at least 70% of the query sequence, with at least 30% nucleotide
identity over the full length of the query sequence. This is an odd
choice and doesn't correspond to the twilight zone limit as implied by
Goris et al. We persist with their definition, however.  Only these
qualifying matches contribute to the total aligned length, and total
aligned sequence identity used to calculate ANI.

(c) The James Hutton Institute 2013-2018
Author: Leighton Pritchard

Contact:
leighton.pritchard@hutton.ac.uk

Leighton Pritchard,
Information and Computing Sciences,
James Hutton Institute,
Errol Road,
Invergowrie,
Dundee,
DD6 9LH,
Scotland,
UK

The MIT License

Copyright (c) 2013-2018 The James Hutton Institute

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
"""

import os
import shutil

import pandas as pd

from Bio import SeqIO

from . import pyani_config
from . import pyani_files
from . import pyani_jobs
from .pyani_tools import ANIResults, BLASTcmds, BLASTexes, BLASTfunctions


# Divide input FASTA sequences into fragments
def fragment_fasta_files(infiles, outdirname, fragsize):
    """Chops sequences of the passed files into fragments, returns filenames.

    - infiles - paths to each input sequence file
    - outdirname - path to output directory
    - fragsize - the size of sequence fragments

    Takes every sequence from every file in infiles, and splits them into
    consecutive fragments of length fragsize, (with any trailing sequences
    being included, even if shorter than fragsize), and writes the resulting
    set of sequences to a file with the same name in the output directory.
    All fragments are named consecutively and uniquely (within a file) as
    fragNNNNN. Sequence description fields are retained.
    """
    outfnames = []
    for fname in infiles:
        outstem, outext = os.path.splitext(os.path.split(fname)[-1])
        outfname = os.path.join(outdirname, outstem) + "-fragments" + outext
        outseqs = []
        count = 0
        for seq in SeqIO.parse(fname, "fasta"):
            idx = 0
            while idx < len(seq):
                count += 1
                newseq = seq[idx : idx + fragsize]
                newseq.id = "frag%05d" % count
                outseqs.append(newseq)
                idx += fragsize
        outfnames.append(outfname)
        SeqIO.write(outseqs, outfname, "fasta")
    return outfnames, get_fraglength_dict(outfnames)


# Get lengths of all sequences in all files
def get_fraglength_dict(fastafiles):
    """Returns dictionary of sequence fragment lengths, keyed by query name.

    - fastafiles - list of FASTA input whole sequence files

    Loops over input files and, for each, produces a dictionary with fragment
    lengths, keyed by sequence ID. These are returned as a dictionary with
    the keys being query IDs derived from filenames.
    """
    fraglength_dict = {}
    for filename in fastafiles:
        qname = os.path.split(filename)[-1].split("-fragments")[0]
        fraglength_dict[qname] = get_fragment_lengths(filename)
    return fraglength_dict


# Get lengths of all sequences in a file
def get_fragment_lengths(fastafile):
    """Returns dictionary of sequence fragment lengths, keyed by fragment ID.

    Biopython's SeqIO module is used to parse all sequences in the FASTA
    file.

    NOTE: ambiguity symbols are not discounted.
    """
    fraglengths = {}
    for seq in SeqIO.parse(fastafile, "fasta"):
        fraglengths[seq.id] = len(seq)
    return fraglengths


# Create dictionary of database building commands, keyed by dbname
def build_db_jobs(infiles, blastcmds):
    """Returns dictionary of db-building commands, keyed by dbname."""
    dbjobdict = {}  # Dict of database construction jobs, keyed by filename
    # Create dictionary of database building jobs, keyed by db name
    # defining jobnum for later use as last job index used
    for idx, fname in enumerate(infiles):
        dbjobdict[blastcmds.get_db_name(fname)] = pyani_jobs.Job(
            "%s_db_%06d" % (blastcmds.prefix, idx), blastcmds.build_db_cmd(fname)
        )
    return dbjobdict


def make_blastcmd_builder(
    mode, outdir, format_exe=None, blast_exe=None, prefix="ANIBLAST"
):
    """Returns BLASTcmds object for construction of BLAST commands."""
    if mode == "ANIb":  # BLAST/formatting executable depends on mode
        blastcmds = BLASTcmds(
            BLASTfunctions(construct_makeblastdb_cmd, construct_blastn_cmdline),
            BLASTexes(
                format_exe or pyani_config.MAKEBLASTDB_DEFAULT,
                blast_exe or pyani_config.BLASTN_DEFAULT,
            ),
            prefix,
            outdir,
        )
    else:
        blastcmds = BLASTcmds(
            BLASTfunctions(construct_formatdb_cmd, construct_blastall_cmdline),
            BLASTexes(
                format_exe or pyani_config.FORMATDB_DEFAULT,
                blast_exe or pyani_config.BLASTALL_DEFAULT,
            ),
            prefix,
            outdir,
        )
    return blastcmds


# Make a dependency graph of BLAST commands
def make_job_graph(infiles, fragfiles, blastcmds):
    """Return a job dependency graph, based on the passed input sequence files.

    - infiles - a list of paths to input FASTA files
    - fragfiles - a list of paths to fragmented input FASTA files

    By default, will run ANIb - it *is* possible to make a mess of passing the
    wrong executable for the mode you're using.

    All items in the returned graph list are BLAST executable jobs that must
    be run *after* the corresponding database creation. The Job objects
    corresponding to the database creation are contained as dependencies.
    How those jobs are scheduled depends on the scheduler (see
    run_multiprocessing.py, run_sge.py)
    """
    joblist = []  # Holds list of job dependency graphs

    # Get dictionary of database-building jobs
    dbjobdict = build_db_jobs(infiles, blastcmds)

    # Create list of BLAST executable jobs, with dependencies
    jobnum = len(dbjobdict)
    for idx, fname1 in enumerate(fragfiles[:-1]):
        for fname2 in fragfiles[idx + 1 :]:
            jobnum += 1
            jobs = [
                pyani_jobs.Job(
                    "%s_exe_%06d_a" % (blastcmds.prefix, jobnum),
                    blastcmds.build_blast_cmd(fname1, fname2.replace("-fragments", "")),
                ),
                pyani_jobs.Job(
                    "%s_exe_%06d_b" % (blastcmds.prefix, jobnum),
                    blastcmds.build_blast_cmd(fname2, fname1.replace("-fragments", "")),
                ),
            ]
            jobs[0].add_dependency(dbjobdict[fname1.replace("-fragments", "")])
            jobs[1].add_dependency(dbjobdict[fname2.replace("-fragments", "")])
            joblist.extend(jobs)

    # Return the dependency graph
    return joblist


# Generate list of makeblastdb command lines from passed filenames
def generate_blastdb_commands(filenames, outdir, blastdb_exe=None, mode="ANIb"):
    """Return a list of makeblastdb command-lines for ANIb/ANIblastall

    - filenames - a list of paths to input FASTA files
    - outdir - path to output directory
    - blastdb_exe - path to the makeblastdb executable
    """
    if mode == "ANIb":
        construct_db_cmdline = construct_makeblastdb_cmd
    else:
        construct_db_cmdline = construct_formatdb_cmd
    if blastdb_exe is None:
        cmdlines = [construct_db_cmdline(fname, outdir) for fname in filenames]
    else:
        cmdlines = [
            construct_db_cmdline(fname, outdir, blastdb_exe) for fname in filenames
        ]
    return cmdlines


# Generate single makeblastdb command line
def construct_makeblastdb_cmd(
    filename, outdir, blastdb_exe=pyani_config.MAKEBLASTDB_DEFAULT
):
    """Returns a single makeblastdb command.

    - filename - input filename
    - blastdb_exe - path to the makeblastdb executable
    """
    title = os.path.splitext(os.path.split(filename)[-1])[0]
    outfilename = os.path.join(outdir, os.path.split(filename)[-1])
    return (
        "{0} -dbtype nucl -in {1} -title {2} -out {3}".format(
            blastdb_exe, filename, title, outfilename
        ),
        outfilename,
    )


# Generate single makeblastdb command line
def construct_formatdb_cmd(filename, outdir, blastdb_exe=pyani_config.FORMATDB_DEFAULT):
    """Returns a single formatdb command.

    - filename - input filename
    - blastdb_exe - path to the formatdb executable
    """
    title = os.path.splitext(os.path.split(filename)[-1])[0]
    newfilename = os.path.join(outdir, os.path.split(filename)[-1])
    shutil.copy(filename, newfilename)
    return (
        "{0} -p F -i {1} -t {2}".format(blastdb_exe, newfilename, title),
        newfilename,
    )


# Generate list of BLASTN command lines from passed filenames
def generate_blastn_commands(filenames, outdir, blast_exe=None, mode="ANIb"):
    """Return a list of blastn command-lines for ANIm

    - filenames - a list of paths to fragmented input FASTA files
    - outdir - path to output directory
    - blastn_exe - path to BLASTN executable

    Assumes that the fragment sequence input filenames have the form
    ACCESSION-fragments.ext, where the corresponding BLAST database filenames
    have the form ACCESSION.ext. This is the convention followed by the
    fragment_FASTA_files() function above.
    """
    if mode == "ANIb":
        construct_blast_cmdline = construct_blastn_cmdline
    else:
        construct_blast_cmdline = construct_blastall_cmdline
    cmdlines = []
    for idx, fname1 in enumerate(filenames[:-1]):
        dbname1 = fname1.replace("-fragments", "")
        for fname2 in filenames[idx + 1 :]:
            dbname2 = fname2.replace("-fragments", "")
            if blast_exe is None:
                cmdlines.append(construct_blast_cmdline(fname1, dbname2, outdir))
                cmdlines.append(construct_blast_cmdline(fname2, dbname1, outdir))
            else:
                cmdlines.append(
                    construct_blast_cmdline(fname1, dbname2, outdir, blast_exe)
                )
                cmdlines.append(
                    construct_blast_cmdline(fname2, dbname1, outdir, blast_exe)
                )
    return cmdlines


# Generate single BLASTN command line
def construct_blastn_cmdline(
    fname1, fname2, outdir, blastn_exe=pyani_config.BLASTN_DEFAULT
):
    """Returns a single blastn command.

    - filename - input filename
    - blastn_exe - path to BLASTN executable
    """
    fstem1 = os.path.splitext(os.path.split(fname1)[-1])[0]
    fstem2 = os.path.splitext(os.path.split(fname2)[-1])[0]
    fstem1 = fstem1.replace("-fragments", "")
    prefix = os.path.join(outdir, "%s_vs_%s" % (fstem1, fstem2))
    cmd = (
        "{0} -out {1}.blast_tab -query {2} -db {3} "
        + "-xdrop_gap_final 150 -dust no -evalue 1e-15 "
        + "-max_target_seqs 1 -outfmt '6 qseqid sseqid length mismatch "
        + "pident nident qlen slen qstart qend sstart send positive "
        + "ppos gaps' -task blastn"
    )
    return cmd.format(blastn_exe, prefix, fname1, fname2)


# Generate single BLASTALL command line
def construct_blastall_cmdline(
    fname1, fname2, outdir, blastall_exe=pyani_config.BLASTALL_DEFAULT
):
    """Returns a single blastall command.

    - blastall_exe - path to BLASTALL executable
    """
    fstem1 = os.path.splitext(os.path.split(fname1)[-1])[0]
    fstem2 = os.path.splitext(os.path.split(fname2)[-1])[0]
    fstem1 = fstem1.replace("-fragments", "")
    prefix = os.path.join(outdir, "%s_vs_%s" % (fstem1, fstem2))
    cmd = (
        "{0} -p blastn -o {1}.blast_tab -i {2} -d {3} "
        + "-X 150 -q -1 -F F -e 1e-15 "
        + "-b 1 -v 1 -m 8"
    )
    return cmd.format(blastall_exe, prefix, fname1, fname2)


# Process pairwise BLASTN output
def process_blast(
    blast_dir,
    org_lengths,
    fraglengths=None,
    mode="ANIb",
    identity=0.3,
    coverage=0.7,
    logger=None,
):
    """Returns a tuple of ANIb results for .blast_tab files in the output dir.

    - blast_dir - path to the directory containing .blast_tab files
    - org_lengths - the base count for each input sequence
    - fraglengths - dictionary of query sequence fragment lengths, only
    needed for BLASTALL output
    - mode - parsing BLASTN+ or BLASTALL output?
    - logger - a logger for messages

    Returns the following pandas dataframes in an ANIResults object;
    query sequences are rows, subject sequences are columns:

    - alignment_lengths - non-symmetrical: total length of alignment
    - percentage_identity - non-symmetrical: ANIb (Goris) percentage identity
    - alignment_coverage - non-symmetrical: coverage of query
    - similarity_errors - non-symmetrical: count of similarity errors

    May throw a ZeroDivisionError if one or more BLAST runs failed, or a
    very distant sequence was included in the analysis.
    """
    # Process directory to identify input files
    blastfiles = pyani_files.get_input_files(blast_dir, ".blast_tab")
    # Hold data in ANIResults object
    results = ANIResults(list(org_lengths.keys()), mode)

    # Fill diagonal NA values for alignment_length with org_lengths
    for org, length in list(org_lengths.items()):
        results.alignment_lengths[org][org] = length

    # Process .blast_tab files assuming that the filename format holds:
    # org1_vs_org2.blast_tab:
    for blastfile in blastfiles:
        qname, sname = os.path.splitext(os.path.split(blastfile)[-1])[0].split("_vs_")

        # We may have BLAST files from other analyses in the same directory
        # If this occurs, we raise a warning, and skip the file
        if qname not in list(org_lengths.keys()):
            if logger:
                logger.warning(
                    "Query name %s not in input " % qname
                    + "sequence list, skipping %s" % blastfile
                )
            continue
        if sname not in list(org_lengths.keys()):
            if logger:
                logger.warning(
                    "Subject name %s not in input " % sname
                    + "sequence list, skipping %s" % blastfile
                )
            continue
        resultvals = parse_blast_tab(blastfile, fraglengths, identity, coverage, mode)
        query_cover = float(resultvals[0]) / org_lengths[qname]

        # Populate dataframes: when assigning data, we need to note that
        # we have asymmetrical data from BLAST output, so only the
        # upper triangle is populated
        results.add_tot_length(qname, sname, resultvals[0], sym=False)
        results.add_sim_errors(qname, sname, resultvals[1], sym=False)
        results.add_pid(qname, sname, 0.01 * resultvals[2], sym=False)
        results.add_coverage(qname, sname, query_cover)
    return results


# Parse BLASTALL output to get total alignment length and mismatches
def parse_blast_tab(filename, fraglengths, identity, coverage, mode="ANIb"):
    """Returns (alignment length, similarity errors, mean_pid) tuple
    from .blast_tab

    - filename - path to .blast_tab file

    Calculate the alignment length and total number of similarity errors (as
    we would with ANIm), as well as the Goris et al.-defined mean identity
    of all valid BLAST matches for the passed BLASTALL alignment .blast_tab
    file.

    '''ANI between the query genome and the reference genome was calculated as
    the mean identity of all BLASTN matches that showed more than 30% overall
    sequence identity (recalculated to an identity along the entire sequence)
    over an alignable region of at least 70% of their length.
    '''
    """
    # Assuming that the filename format holds org1_vs_org2.blast_tab:
    qname = os.path.splitext(os.path.split(filename)[-1])[0].split("_vs_")[0]
    # Load output as dataframe
    if mode == "ANIblastall":
        qfraglengths = fraglengths[qname]
        columns = [
            "sid",
            "blast_pid",
            "blast_alnlen",
            "blast_mismatch",
            "blast_gaps",
            "q_start",
            "q_end",
            "s_start",
            "s_end",
            "e_Value",
            "bit_score",
        ]
    else:
        columns = [
            "sbjct_id",
            "blast_alnlen",
            "blast_mismatch",
            "blast_pid",
            "blast_identities",
            "qlen",
            "slen",
            "q_start",
            "q_end",
            "s_start",
            "s_end",
            "blast_pos",
            "ppos",
            "blast_gaps",
        ]

    # We may receive an empty BLASTN output file, if there are no significant
    # regions of homology. This causes pandas to throw an error on CSV import.
    # To get past this, we create an empty dataframe with the appropriate
    # columns.
    try:
        data = pd.read_csv(filename, header=None, sep="\t", index_col=0)
        data.columns = columns
    except pd.errors.EmptyDataError:
        data = pd.DataFrame(columns=columns)
    # Add new column for fragment length, only for BLASTALL
    if mode == "ANIblastall":
        data["qlen"] = pd.Series(
            [qfraglengths[idx] for idx in data.index], index=data.index
        )

    # Add new columns for recalculated alignment length, proportion, and
    # percentage identity
    data["ani_alnlen"] = data["blast_alnlen"] - data["blast_gaps"]
    data["ani_alnids"] = data["ani_alnlen"] - data["blast_mismatch"]
    data["ani_coverage"] = data["ani_alnlen"] / data["qlen"]
    data["ani_pid"] = data["ani_alnids"] / data["qlen"]

    # Filter rows on 'ani_coverage' > 0.7, 'ani_pid' > 0.3
    filtered = data[(data["ani_coverage"] > coverage) & (data["ani_pid"] > identity)]
    # Dedupe query hits, so we only take the best hit
    filtered = filtered.groupby(filtered.index).first()
    # Replace NaNs with zero
    filtered = filtered.fillna(value=0)  # Needed if no matches

    # The ANI value is then the mean percentage identity.
    # We report total alignment length and the number of similarity errors
    # (mismatches and gaps), as for ANIm
    # NOTE: We report the mean of 'blast_pid' for concordance with JSpecies
    # Despite this, the concordance is not exact. Manual inspection during
    # development indicated that a handful of fragments are differentially
    # filtered out in JSpecies and this script. This is often on the basis
    # of rounding differences (e.g. coverage being close to 70%).
    # NOTE: If there are no hits, then ani_pid will be nan - we replace this
    # with zero if that happens
    ani_pid = filtered["blast_pid"].mean()
    if pd.isnull(ani_pid):  # Happens if there are no matches in ANIb
        ani_pid = 0
    aln_length = filtered["ani_alnlen"].sum()
    sim_errors = filtered["blast_mismatch"].sum() + filtered["blast_gaps"].sum()
    filtered.to_csv(filename + ".dataframe", sep="\t")
    return aln_length, sim_errors, ani_pid