File: README

package info (click to toggle)
uc-echo 1.12-19
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,904 kB
  • sloc: cpp: 1,405; python: 659; makefile: 44; sh: 8
file content (169 lines) | stat: -rw-r--r-- 6,213 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
ECHO: Short-read Error Correction
v1.12

Prerequisites:
-------------
1. GNU GCC (version >= 4.1)
    http://gcc.gnu.org/

2. Python (version >= 2.6)
   (64-bit Python is needed for large data sets, e.g. >= 10M reads)
    http://www.python.org/

3. SciPy (version >= 0.7)
    http://www.scipy.org/

4. NumPy (version >= 1.3)
    http://numpy.scipy.org/

License:
-------
ECHO is released under the BSD License. See the LICENSE file for more details.

Quickstart:
----------
1. In the directory containing the source files, type
    > make

2. In the same directory, type:
    > ErrorCorrection.py -o output/sample_data.fastq sample_data.txt
    
    - sample_data.txt is the input file.
    - output/sample_data.fastq is the output file name.

3. The output files will be in the directory called "output".
   There will be 3 files:
    a. sample_data.fastq
        The corrected reads in FASTQ format.
    b. sample_data.fastq.seq
        Only the reads, without the quality scores.
    c. sample_data.fastq.qual
        Only the quality scores.

4. The output file sample_data.fastq.seq can be compared to sample_answer.txt, which contains the original reads without errors. There will be differences between sample_data.fastq.seq and sample_answer.txt because not every error is corrected by the error correction.

5. To run with 4 threads, type
    > ErrorCorrection.py --ncpu 4 -o output/sample_data.fastq sample_data.txt

Installation:
------------
In the directory containing the source files, type
    > make

For ECHO to allocate more than 4GB of memory, it must be compiled with the -m64 option (default). This requires a 64-bit machine.

Usage:
-----
ECHO accepts 4 input formats:
    1) Plain text format
    2) Plain text format gzipped
    3) FASTQ format
    3) FASTQ format gzipped

Plain text format is a file with one line per read. Note that this is *not* the same as the FASTA format. The Plain text format does *not* allow reads to span across multiple lines, and it does not have "comment" lines. The read should be oriented so that bases sequenced first are at the beginning of the line.

Also note that FASTA format is *not* supported by  ECHO at the moment.

Either the plain text format or the FASTQ format may be gzipped, and used as input to ECHO.

Note that the parsing method used by ECHO is selected depending on the *extension* of the input file:
    .gz is parsed as a gzipped Plain text format file.
    .fastq is parsed as a FASTQ file.
    .fastq.gz is parsed as a gzipped FASTQ file.

The output file format of ECHO is standard FASTQ format.

To run ECHO, use the following command
> ErrorCorrection.py -o output.fastq input.txt

NOTE: The above command will generally work on relatively small data set sizes. The exact size depends on the amount of avaiable RAM. Please read the below 'Notes on Memory Usage' for additional details on running ECHO on larger data sets.

Help messages for additional options can be found by using
> ErrorCorrection.py -h

Notes on Memory Usage:
---------------------
It is very important that ECHO does not saturate all available memory. If this happens, ECHO will begin to swap with the hard disk, which will cause it to run very slowly, much more slowly than it would otherwise if certain parameters were set correctly.

There are two parameters that should be tweaked:
a) -b denotes the number of reads per read block.

b) --nh will divide the kmer space into nh blocks.
    The larger the data set, the higher nh needs to be.
    Exactly what this value needs to be depends on the available RAM.

These parameters are set on the command line.

Examples:
For 5M reads, the following will typically run well with 8GB memory:
> ErrorCorrection.py -b 2000000 --nh 256 -o output/5Mreads.fastq 5Mreads.txt

For 20M reads, the following will typically run well with 8GB memory:
> ErrorCorrection.py -b 2000000 --nh 1024 -o output/20Mreads.fastq 20Mreads.txt

Using Multiple Threads: 
----------------------
Use the --ncpu command line option to select the number of threads to use during the error correction. Note that because the main bottleneck is due to memory constraints, only a few stages of the error correction process use more than one thread. 

Example:
To use 6 threads at once:
> ErrorCorrection.py -b 2000000 --nh 256 --ncpu 6 -o output/5Mreads.fastq 5Mreads.txt

Log files:
---------
Every execution of ECHO produces a log file. The default location is the log/ directory. The log file records ECHO's progress during the run.

History of ECHO versions:
------------------------

Changes in 1.12:
----------------
- Fixed bug regarding custom temporary directory names.

Changes in 1.11:
----------------
- Changed struct type for writing out reads to unsigned long long from unsigned long in ErrorCorrection.py. This was causing some compatibility issues.

Changes in 1.10:
----------------
- Fixed a bug where temporary files were created in /tmp, rather than ./tmp.

Changes in 1.09:
----------------
- Removed dependence on C++0x features. Now using several functions from TR1, which has been included since at least gcc 4.1.

Changes in 1.08:
----------------
- Changed which kmer bins are considered for constructing the read overlap graph. Kmer bins with too many reads are not ignored to improve running times.

Changes in 1.07a:
----------------
- Changed default max_paramh to 50 from 24. 

Changes in 1.06a:
----------------
- Changed voting procedure and quality score computation.

--------------------------------------------------------------
Changes from 1.04 and 1.05 are not applicable to this version.
--------------------------------------------------------------

Changes in 1.03:
----------------
- Changed data type for read indexing in the reverse complement file to accommodate for large data sets (>30M reads).

Changes in 1.02:
----------------
- Added support for FASTQ and gzipped FASTQ input files.

Changes in 1.01:
----------------
- Added support for gzipped input files.
- Fixed a bug in the error message when checking for the existence of the input file.

Bug Reports:
-----------
Wei-Chun Kao <wckao@EECS.Berkeley.EDU>
Andrew H. Chan <andrewhc@EECS.Berkeley.EDU>
Department of EECS
U.C. Berkeley