File: chapter_introduction.rst

package info (click to toggle)
python-biopython 1.85%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 126,372 kB
  • sloc: xml: 1,047,995; python: 332,722; ansic: 16,944; sql: 1,208; makefile: 140; sh: 81
file content (311 lines) | stat: -rw-r--r-- 12,331 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
.. _`chapter:introduction`:

Introduction
============

What is Biopython?
------------------

Biopython is a collection of freely available Python
(https://www.python.org) modules for computational molecular biology.
Python is an object oriented, interpreted, flexible language that is
widely used for scientific computing. Python is easy to learn, has a
very clear syntax and can easily be extended with modules written in C,
C++ or FORTRAN. Since its inception in 2000 [Chapman2000]_, Biopython
has been continuously developed and maintained by a large group of
volunteers worldwide.

The Biopython web site (http://www.biopython.org) provides an online
resource for modules, scripts, and web links for developers of
Python-based software for bioinformatics use and research. Biopython
includes parsers for various bioinformatics file formats (BLAST,
Clustalw, FASTA, Genbank, ...), access to online services (NCBI,
Expasy, ...), a standard sequence class, sequence alignment and motif
analysis tools, clustering algorithms, a module for structural biology,
and a module for phylogenetics analysis.

What can I find in the Biopython package
----------------------------------------

The main Biopython releases have lots of functionality, including:

-  The ability to parse bioinformatics files into Python utilizable data
   structures, including support for the following formats:

   -  Blast output – both from standalone and WWW Blast

   -  Clustalw

   -  FASTA

   -  GenBank

   -  PubMed and Medline

   -  ExPASy files, like Enzyme and Prosite

   -  SCOP, including ‘dom’ and ‘lin’ files

   -  UniGene

   -  SwissProt

-  Files in the supported formats can be iterated over record by record
   or indexed and accessed via a Dictionary interface.

-  Code to deal with popular on-line bioinformatics destinations such
   as:

   -  NCBI – Blast, Entrez and PubMed services

   -  ExPASy – Swiss-Prot and Prosite entries, as well as Prosite
      searches

-  Interfaces to common bioinformatics programs such as:

   -  Standalone Blast from NCBI

   -  Clustalw alignment program

   -  EMBOSS command line tools

-  A standard sequence class that deals with sequences, ids on
   sequences, and sequence features.

-  Tools for performing common operations on sequences, such as
   translation, transcription and weight calculations.

-  Code to perform classification of data using k Nearest Neighbors,
   Naive Bayes or Support Vector Machines.

-  Code for dealing with alignments, including a standard way to create
   and deal with substitution matrices.

-  Code making it easy to split up parallelizable tasks into separate
   processes.

-  GUI-based programs to do basic sequence manipulations, translations,
   BLASTing, etc.

-  Extensive documentation and help with using the modules, including
   this file, on-line wiki documentation, the web site, and the mailing
   list.

-  Integration with BioSQL, a sequence database schema also supported by
   the BioPerl and BioJava projects.

We hope this gives you plenty of reasons to download and start using
Biopython!

Installing Biopython
--------------------

All of the installation information for Biopython was separated from
this document to make it easier to keep updated.

The short version is use ``pip install biopython``, see the `main
README <https://github.com/biopython/biopython/blob/master/README.rst>`__
file for other options.

Frequently Asked Questions (FAQ)
--------------------------------

#. | *How do I cite Biopython in a scientific publication?*
   | Please cite our application note [Cock2009]_ as the
     main Biopython reference. In addition, please cite any publications
     from the following list if appropriate, in particular as a
     reference for specific modules within Biopython (more information
     can be found on our website):

   -  For the official project announcement:
      Chapman and Chang, 2000 [Chapman2000]_;

   -  For ``Bio.PDB``:
      Hamelryck and Manderick, 2003 [Hamelryck2003A]_;

   -  For ``Bio.Cluster``:
      De Hoon *et al.*, 2004 [DeHoon2004]_;

   -  For ``Bio.Graphics.GenomeDiagram``:
      Pritchard *et al.*, 2006 [Pritchard2006]_;

   -  For ``Bio.Phylo`` and ``Bio.Phylo.PAML``:
      Talevich *et al.* 2012 [Talevich2012]_;

   -  For the FASTQ file format as supported in Biopython, BioPerl,
      BioRuby, BioJava, and EMBOSS:
      Cock *et al.*, 2010 [Cock2010]_.

#. | *How should I capitalize “Biopython”? Is “BioPython” OK?*
   | The correct capitalization is “Biopython”, not “BioPython” (even
     though that would have matched BioPerl, BioJava and BioRuby).

#. | *How is the Biopython software licensed?*
   | Biopython is distributed under the *Biopython License Agreement*.
     However, since the release of Biopython 1.69, some files are
     explicitly dual licensed under your choice of the *Biopython
     License Agreement* or the *BSD 3-Clause License*. This is with the
     intention of later offering all of Biopython under this dual
     licensing approach.

#. | *What is the Biopython logo and how is it licensed?*
   | As of July 2017 and the Biopython 1.70 release, the Biopython logo
     is a yellow and blue snake forming a double helix above the word
     “biopython” in lower case. It was designed by Patrick Kunzmann and
     this logo is dual licensed under your choice of the *Biopython
     License Agreement* or the *BSD 3-Clause License*.
   | |new-logo|
   | Prior to this, the Biopython logo was two yellow snakes forming a
     double helix around the word “BIOPYTHON”, designed by Henrik
     Vestergaard and Thomas Hamelryck in 2003 as part of an open
     competition.
   | |old-logo|

#. | *Do you have a change-log listing what’s new in each release?*
   | See the file ``NEWS.rst`` included with the source code (originally
     called just ``NEWS``), or read the `latest NEWS file on
     GitHub <https://github.com/biopython/biopython/blob/master/NEWS.rst>`__.

#. | *What is going wrong with my print commands?*
   | As of Biopython 1.77, we only support Python 3, so this tutorial
     uses the Python 3 style print *function*.

#. | *How do I find out what version of Biopython I have installed?*
   | Use this:

   .. code:: pycon

      >>> import Bio
      >>> print(Bio.__version__)

   If the “``import Bio``” line fails, Biopython is not installed. Note
   that those are double underscores before and after version. If the
   second line fails, your version is *very* out of date.

   If the version string ends with a plus like “``1.66+``”, you don’t
   have an official release, but an old snapshot of the in development
   code *after* that version was released. This naming was used until
   June 2016 in the run-up to Biopython 1.68.

   If the version string ends with “``.dev<number>``” like
   “``1.68.dev0``”, again you don’t have an official release, but
   instead a snapshot of the in development code *before* that version
   was released.

#. | *Where is the latest version of this document?*
   | If you download a Biopython source code archive, it will include
     the relevant version in both HTML and PDF formats. The latest
     published version of this document (updated at each release) is
     online:

   -  http://biopython.org/DIST/docs/tutorial/Tutorial.html

   -  http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

#. | *What is wrong with my sequence comparisons?*
   | There was a major change in Biopython 1.65 making the ``Seq`` and
     ``MutableSeq`` classes (and subclasses) use simple string-based
     comparison which you can do explicitly with
     ``str(seq1) == str(seq2)``.

   Older versions of Biopython would use instance-based comparison for
   ``Seq`` objects which you can do explicitly with
   ``id(seq1) == id(seq2)``.

   If you still need to support old versions of Biopython, use these
   explicit forms to avoid problems. See
   Section :ref:`sec:seq-comparison`.

#. | *What file formats do* ``Bio.SeqIO`` *and* ``Bio.AlignIO`` *read
     and write?*
   | Check the built-in docstrings (``from Bio import SeqIO``, then
     ``help(SeqIO)``), or see http://biopython.org/wiki/SeqIO and
     http://biopython.org/wiki/AlignIO on the wiki for the latest
     listing.

#. | *Why won’t the* ``Bio.SeqIO`` *and* ``Bio.AlignIO`` *functions*
     ``parse``\ *,* ``read`` *and* ``write`` *take filenames? They
     insist on handles!*
   | You need Biopython 1.54 or later, or just use handles explicitly
     (see Section :ref:`sec:appendix-handles`).
     It is especially important to remember to close output handles
     explicitly after writing your data.

#. | *Why won’t the* ``Bio.SeqIO.write()`` *and* ``Bio.AlignIO.write()``
     *functions accept a single record or alignment? They insist on a
     list or iterator!*
   | You need Biopython 1.54 or later, or just wrap the item with
     ``[...]`` to create a list of one element.

#. | *Why doesn’t* ``str(...)`` *give me the full sequence of a* ``Seq``
     *object?*
   | You need Biopython 1.45 or later.

#. | *Why doesn’t* ``Bio.Blast`` *work with the latest plain text NCBI
     blast output?*
   | The NCBI keep tweaking the plain text output from the BLAST tools,
     and keeping our parser up to date is/was an ongoing struggle. If
     you aren’t using the latest version of Biopython, you could try
     upgrading. However, we (and the NCBI) recommend you use the XML
     output instead, which is designed to be read by a computer program.

#. | *Why has my script using* ``Bio.Entrez.efetch()`` *stopped
     working?*
   | This could be due to NCBI changes in February 2012 introducing
     EFetch 2.0. First, they changed the default return modes - you
     probably want to add ``retmode="text"`` to your call. Second, they
     are now stricter about how to provide a list of IDs – Biopython
     1.59 onwards turns a list into a comma separated string
     automatically.

#. | *Why doesn’t* ``Bio.Blast.NCBIWWW.qblast()`` *give the same results
     as the NCBI BLAST website?*
   | You need to specify the same options – the NCBI often adjust the
     default settings on the website, and they do not match the QBLAST
     defaults anymore. Check things like the gap penalties and
     expectation threshold.

#. | *Why can’t I add* ``SeqRecord`` *objects together?*
   | You need Biopython 1.53 or later.

#. | *Why doesn’t* ``Bio.SeqIO.index_db()`` *work? The module imports
     fine but there is no ``index_db`` function!*
   | You need Biopython 1.57 or later (and a Python with SQLite3
     support).

#. | *Where is the* ``MultipleSeqAlignment`` *object? The* ``Bio.Align``
     *module imports fine but this class isn’t there!*
   | You need Biopython 1.54 or later. Alternatively, the older
     ``Bio.Align.Generic.Alignment`` class supports some of its
     functionality, but using this is now discouraged.

#. | *Why can’t I run command line tools directly from the application
     wrappers?*
   | You need Biopython 1.55 or later, but these were deprecated in
     Biopython 1.78. Consider using the Python ``subprocess`` module
     directly.

#. | *I looked in a directory for code, but I couldn’t find the code
     that does something. Where’s it hidden?*
   | One thing to know is that we put code in ``__init__.py`` files. If
     you are not used to looking for code in this file this can be
     confusing. The reason we do this is to make the imports easier for
     users. For instance, instead of having to do a “repetitive” import
     like ``from Bio.GenBank import GenBank``, you can just use
     ``from Bio import GenBank``.

#. | *Why doesn’t* ``Bio.Fasta`` *work?*
   | We deprecated the ``Bio.Fasta`` module in Biopython 1.51 (August
     2009) and removed it in Biopython 1.55 (August 2010). There is a
     brief example showing how to convert old code to use ``Bio.SeqIO``
     instead in the
     `DEPRECATED.rst <https://github.com/biopython/biopython/blob/master/DEPRECATED.rst>`__
     file.

For more general questions, the Python FAQ pages
https://docs.python.org/3/faq/index.html may be useful.

.. |new-logo| image:: ../images/biopython_logo_m.png
   :width: 6cm
.. |old-logo| image:: ../images/biopython_logo_old.jpg
   :width: 7cm