File: README.md

package info (click to toggle)
hilive 2.0a-5
  • links: PTS, VCS
  • area: main
  • in suites: experimental
  • size: 956 kB
  • sloc: cpp: 4,723; python: 241; xml: 188; sh: 35; makefile: 14
file content (449 lines) | stat: -rw-r--r-- 19,860 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
HiLive2 - Live Mapping of Illumina reads (Tutorial)
===================================================

Content
-------

 - [Introduction](#introduction)
  - [Installation](#installation)
  - [Example data](#example-data)
 - [Get started](#get-started)
  - [Index building](#index-building)
  - [Run HiLive2](#run-hilive2)
 - [HiLive2 options](#hilive2-options)
  - [Important general options](#important-general-options)
  - [Paired-end sequencing](#paired-end-sequencing) 
  - [Live demultiplexing](#live-demultiplexing)
  - [Output cycles](#output-cycles)
  - [Countinue an interrupted run](#continue-an-interrupted-run)
 - [Using a configuration file](#using-a-configuration-file)
 - [Using the hilive-out executable](#using-the-hilive-out-executable)
 - [Additional remarks](#additional-remarks)


Introduction
------------

HiLive2 is a real-time read mapping software. Input data are directly obtained from an Illumina sequencing machine. Thus, your machine must have access to the output directory of the sequencer.  
We provide several scripts for creating test data for HiLive2, e.g. `fastq2bcl.py` to convert Illumina reads in FASTQ format to the Illumina BaseCalls format. Please find these scripts on the [HiLive2 repository](https://gitlab.com/lokat/hilive2/scripts).  

### Installation
Please make sure you installed HiLive2 as described in the README.md of the program.  
To test HiLive2, please try to open the help of the three executables:

```
> /path/to/hilive -h
> /path/to/hilive-out -h
> /path/to/hilive-build -h
```

### Example data
In the provided sample data, you find the genome of Human immunodeficiency virus 1 (HIV1) in FASTA format. The genome was obtained from NCBI under accession number NC_001802.1 (https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1).  
From this genome, we simulated 100 paired-end reads of length 2x100bp with the mason read simulator (Holtgrewe, 2010). We manually added barcode sequences of length 8bp between both reads and converted them to Illumina BaseCalls format using our `fastq2bcl.py` script.

Get started
-----------

At the beginning of the tutorial, please change your working directory to the tutorial data:

```
> cd /path/to/HiLive2/tutorial
> ls
```

The `ls` command should list four files/directories:

```
BaseCalls
config.ini
hiv1.fa
README.md
```

### Index building
To use HiLive2, you first need to build an index of your reference genome(s) of interest. In our example, this is the reference genome of HIV1.  
Use the `hilive-build` executable to build your index. Please note, that the output directory must exist:

```
> mkdir index
> hilive-build -i hiv1.fa -o index/hiv1
```

This should only take a few seconds and the command line output should look like this:

```
________________________________________________________________________________

HiLive Index Builder v2.0 - Build Index for Realtime Alignment of Illumina Reads
________________________________________________________________________________
                                                                                
Create FM-index from file hiv1.fa ...
Index was successfully built to file(s) index/hiv1                              
```

### Run HiLive2
After [buiding your index](#index-building), you can run HiLive2 with the prepared input data. The data are in Illumina BaseCall format and placed in the `BaseCalls` directory. The structure of the data in our example is the same as for real data. Depending on the actual sequencing machine (HiSeq, MiSeq, etc.), the data is split into different lanes and tiles. In our example, there is only one lane (L001) and one tile (1101).  
To run HiLive2 with default parameters, type:

```
> hilive -b BaseCalls -i index/hiv1 -r 100R --lanes 1 --tiles 1101 
```

The command line output should look similar to this:

```
__________________________________________________

HiLive v2.0 - Realtime Alignment of Illumina Reads
__________________________________________________

Running HiLive with       1 thread(s).
BaseCalls directory:      BaseCalls
Temporary directory:      ./temp
BAM output directory:     ./out
Lanes:                    1
K-mer index:              index/hiv1
Read lengths:             100R
Min. alignment score:     -24
Mapping mode:             ANYBEST
Anchor length:            11

Loading Index ...
Waiting for the first cycle to finish...
Initializing Alignment files...
First cycle complete. Starting alignment.
Creating 1 threads.
Task [Lane 1 Tile 1101 Cycle 1.1]: Found 0 seeds.
Task [Lane 1 Tile 1101 Cycle 1.2]: Found 0 seeds.
Task [Lane 1 Tile 1101 Cycle 1.3]: Found 0 seeds.

[...]

Task [Lane 1 Tile 1101 Cycle 1.100]: Found 103 seeds.
Finished all alignments.
Waiting for output tasks...
Finished output tasks.
All threads joined.
Total run time: 31 s
``` 

HiLive2 created two new directories: `out` and `temp`. In the `temp` directory, you find temporary alignment files of the last cycle (100). Additionally, there is a file `hilive_config.ini` which stores the parameter settings used by HiLive2 for the given run. The `out` directory contains the actual output files in BAM format. There is one cycle for each cycle and barcode. For the example command with default parameters, this is only one single file `hilive_out_cycle100_undetermined.bam`.


HiLive2 options
---------------

### Important general options:

This sections lists the options of HiLive2 that we expect to be most important when using HiLive2.  
This short introductions does not contain information about options concerning [paired-end sequencing](#paired-end-sequencing), [live demultiplexing](#live-demultiplexing) and [output cycles](#output-cycles) as they are described in separated section to provide more details.  

##### Alignment mode
HiLive2 comes with three different alignment modes: `FAST`, `BALANCED` or `ACCURATE`.  
By default, HiLive2 is started in `BALANCED` mode. Use the `--align-mode` option to make HiLive2 run faster (`FAST`) or more sensitive (`ACCURATE`):

```
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R --align-mode FAST
```

The selected alignment mode can affect several algorithmic parameters, as the anchor length or seeding interval.

##### Report mode
HiLive2 comes with several report modes:

 * `ANYBEST` reports one best alignment  
 * `ALLBEST` reports all best alignments
 * `BESTN#` reports the best up to # alignments (also suboptimal)
 * `ALL` reports all alignments found by the algorithm
 * `UNIQUE` reports only unique alignments (only aligning to one position)

Use the `--out-mode` option to select the respective report mode:

```
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R --out-mode BESTN5
```

##### Lanes and tiles
Different Illumina sequencing machines (and run modes/protocols) have a different number of lanes and tiles.  
These can be set with the `--lanes` and `--tiles` option as a comma-separated list:

```
# This command will not work with the example data since lane 2 and tile 1102 are missing.
> hilive -b BaseCalls -i index/hiv1 --lanes 1,2 --tiles 1101,1102 -r 100R
```

Since the number of tiles can be very large for real sequencing applications (e.g., 96 tiles for Illumina HiSeq in High Output mode), HiLive2 also comes with the `--max-tile` option.
By typing the maximum tile number that occurs for your sequencing run, all other numbers will be computed automatically according to Illumina nomenclature.  
The following command will include the tiles 1101-1116, 1201-1216, 2101-2116 and 2201-2216:

```
# This command will not work with the example data since several tiles are missing.
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --max-tile 2216 -r 100R
```

Alternatively, the number of lanes and tiles can be determined automatically from Illumina's `runInfo.xml` file by using the `--runinfo` option of HiLive2 (not shown in this tutorial).

##### Output and temp directories
HiLive2 writes output and temp files in two different directories.  
By default, this is `./out` and `./temp` for the output and temporary files, respectively. The directories are created in your current working location.  
Use the `-o [--out-dir]` and `--temp-dir` parameters to change the output and temporary directory, respectively. Not existing directories are automatically created.  
  
NOTE: Existing files from previous runs will be overwritten. We strongly recommend to use separated output and temporary directories for each sequencing run!

##### Output format
You may change the output format from `BAM` (default) to `SAM` by using `--out-format SAM` or `-f SAM`.

##### Multithreading
Please specify the maximal number of threads with the `-n [--num-threads]` option.  
The default number of threads is 1. However, we recommend to use one thread for each analyzed tile.


### Paired-end sequencing

To align both reads rather than only the first 100bp, the segment order must be specified with the `-r [--reads]` option.  
In Illumina sequencing, barcodes of paired-end sequencing mostly occur in the middle of both reads. For the given data, we simulated 2x100bp reads and 2x4bp barcodes in between them.
This corresponds to the segment structure 100R-4B-4B-100R. In HiLive2, specify the segments order as a comma-separated list:

```
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R 
```

Alignments for both reads are written to the same BAM file as defined in the SAM/BAM specification.  
However, HiLive2 only reports independent alignments for both reads rather than proper pairs.


### Live demultiplexing

With live demultiplexing, it is possible to distinguish different samples of the same sequencing run.  
HiLive2 supports arbitrary combinations of the segment structure which also allows to use dual barcodes as shown in this example (2x4bp).  
Before you continue, please remove all previously created data:

```
> rm -r ./out
> rm -r ./temp
```

Our example data set contains reads with two slightly different Barcodes:
```
ACAG-TCGA
|| |-| ||
ACGG-TGGA
```

With default parameters, two errors per barcode fragment are tolerated. Thus, all reads fulfill the criteria and will be aligned and reported.  
The desired barcode sequence(s) are declared with the `-B [--barcodes]` option.  
The number of permitted barcodes per fragment can be set with `--barcode-errors`.  
While both parameters are comma-separated lists in general, the tolerated number of errors can also be set with a single number to set this value for all fragments. 
Dual barcodes are separated with a "-" character. 
To also align and report undetermined barcodes, set the `--align-undetermined-barcodes` option.  
There are 75 reads with the first and 25 reads with the second barcode. 
For each read, there are two reported alignments (one per segment).  
The following examples show how the number of reads in the respective output files behaves for the given data: 

```
# Default barcode options
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R -B ACAG-TCGA --align-undetermined-barcodes
[...]
> samtools view out/hilive_out_cycle208_ACAG-TCGA.bam | wc -l
200
> samtools view out/hilive_out_cycle208_undetermined.bam | wc -l
0

# Strict barcode options (not tolerating errors)
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R -B ACAG-TCGA --barcode-errors 0 --align-undetermined-barcodes
[...]
> samtools view out/hilive_out_cycle208_ACAG-TCGA.bam | wc -l
150
> samtools view out/hilive_out_cycle208_undetermined.bam | wc -l
50

# Tolerate 0 errors in the first barcode segment and 1 error in the second, don't align reads with undetermined barcode
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R -B ACAG-TCGA --barcode-errors 0,1
[...]
> samtools view out/hilive_out_cycle208_ACAG-TCGA.bam | wc -l
150
> samtools view out/hilive_out_cycle208_undetermined.bam | wc -l
[E::hts_open_format] fail to open file 'out/hilive_out_cycle208_undetermined.bam'
samtools view: failed to open "out/hilive_out_cycle208_undetermined.bam" for reading: No such file or directory
0

# Specify both barcodes, don't align reads with undetermined barcode. 
# In this example, all reads match the first barcode with the default number of tolerated error, thus all reads are in the output file of ACAG-TCGA
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R -B ACAG-TCGA,ACGG-TGGA
[...]
> samtools view out/hilive_out_cycle208_ACAG-TCGA.bam | wc -l
200
> samtools view out/hilive_out_cycle208_ACGG-TGGA.bam | wc -l
0

# If the tolerated number of error is adapted, the barcodes are assigned as expected:
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R -B ACAG-TCGA,ACGG-TGGA --barcode-errors 0
[...]
> samtools view out/hilive_out_cycle208_ACAG-TCGA.bam | wc -l
150
> samtools view out/hilive_out_cycle208_ACGG-TGGA.bam | wc -l
50

```


### Output cycles
When performing real-time read mapping, it is often valuable to get intermediate mapping results when the sequencing machine is still running rather than only producing results after the last sequencing cycle.  
Therefore, it is possible to write SAM/BAM output files for intermediate sequencing cycles using the `-O [--out-cycles]` option.  
In the following example, output files are written in the cycles 50, 100, 158 and 208:

```
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R -B ACAG-TCGA -O 50,100,158,208
[...]
> ls -lah ./out
hilive_out_cycle50_ACAG-TCGA.bam
hilive_out_cycle100_ACAG-TCGA.bam
hilive_out_cycle158_ACAG-TCGA.bam
hilive_out_cycle208_ACAG-TCGA.bam
```

### Continue an interrupted run
If a HiLive2 run is interrupted for some reason, it can be continued from a specified cycle. This cycle must be finished for all lanes and tiles. You have to check manually for this.  
To continue a run, use the `--continue` parameter and specify the cycle after the last available temporary alignment files for all cycles. You should also use the automatically created configuration file for the original HiLive2 run as an input.  
To try this, run HiLive2 on the example data with the following settings:

```
> hilive -b BaseCalls -i index/hiv1 --lanes 1 --tiles 1101 -r 100R,4B,4B,100R -B ACAG-TCGA -O 50,100,158,208 --keep-files 25,75,108,133,183
```

With this command, you can continue the run in cycle 26, 51, 101, 109, 134, 159 or 184 (only these temporary alignment files are available). You might interrupt your run after cycle 75.  
Then, continue your run with the following options:

```
> hilive --continue 76 --config ./temp/hilive_config.ini
```


Using a configuration file
--------------------------

HiLive2 supports config files in `.ini` format. This format is pretty straightforward in general: Just use a single line of the following format to describe one option-value-pair:

```
option=value
```

Please note that you have to use the full-length option declaration, so for specifying the number of threads, use `num-threads` rather than `n` as a key:

```
num-threads=32
```

For multitoken arguments, use the comma-separated format as described for the command line input:

```
reads=100R,4B,4B,100R
barcodes=ACAG-TCGA,ACGG-TGGA
```

To set a bool switch (or flag), set the value for the option to `true`:

```
extended-cigar=true
```

Please note that bool switches (or flags) set in the configuration file to the non-default value cannot be deactivated via the command line.  
  
HiLive2 follows a priorization principle for command line input in the following order (from high to low priority):

 * Command line input
 * Config file input
 * runInfo parsing
 * Default value

This means, when setting options in both the configuration file and the command line, the value provided via the command line will be used.  
It was implemented like this to support a template system where configuration files can be used to set default settings and further command specification can be done via the command line.  
For example, there could be a configuration file template `HiSeq2500_rapid.ini` to specify the correct number of lanes and tiles for a HiSeq2500 run.  
  
The `config.ini` file provided with the example data looks like this:
```
bcl-dir=./BaseCalls
lanes=1
tiles=1101
reads=100R,4B,4B,100R
barcodes=ACAG-TCGA
out-dir=./out
out-format=BAM
out-cycles=50,100,158,208
out-mode=ANYBEST
index=index/hiv1
align-undetermined-barcodes=true
temp-dir=./temp
```

To execute HiLive2 with the settings given in the example `config-ini`, just type the following command:

```
> hilive --config config.ini
```

Please note that, in the given example, it is not possible to deactivate the `align-undetermined-barcodes` parameter via the command line.  
To add or replace other options obtained from the configuration file, just type the respective options on the command line:  

```
> hilive --config config.ini --out-dir ./out_new --extended-cigar
```

The given example will change the output directory and activate extended CIGAR format in the output files.  


Using the `hilive-out` executable
---------------------------------

Before starting this part of the tutorial, please remove all previously created directories:

```
> rm -r ./out* && rm -r ./temp
```

The `hilive-out` executable can be used for two general use cases:

 * Write output for cycles that were not written when running HiLive2
 * Write output with different output parameters

In general, all parameters that can be set for the `hilive` executable can also be set for `hilive-out`.  
However, the only required option for `hilive-out` is `--config` which may not directly be intuitive.  
During each run of HiLive2, a configuration file containing the run settings is created automatically in the specified temporary directory. 
This is the configuration file that should be used to run `hilive-out`. In doing so, all run settings are automatically also applited to `hilive-out`.  
It should be noted that paths are stored as specified by the user, so pay attention when using relative paths!  
  
To prepare the execution of `hilive-out` on the example data, first run HiLive2 with the given configuration file:

```
> hilive --config config.ini
```

In the configuration file, the output cycle 50,100,158 and 208 are specified.  
Now try to create an output of cycle 75 by using `hilive-out` and specifying the output cycle.  
Please pay attention that the configuration file in `temp/hilive_config.ini` is used rather than the configuration file used before.  
However, you will recognize that the execution is not successful:

```
> hilive-out --config ./temp/hilive_config.ini -O 75
[...]
Writing of task Lane 1 Tile 1101 Cycle b.75 failed:  File  does not exist.
Finished output of cycle 75 (0 finished, 1 failed).
Finished.
```

This is because output can only be written for a cycle if the temporary alignment files were stored. This is done only for the specified output cycles by default.    
Keeping files can be specified manually when using `--keep-all-files`, which stores the files of all cycles, or `--keep-files` to keep the files of specified cycles.  
Please pay attention that in a real sequencing scenario the `--keep-all-files` option might lead to huge disk space requirements.  
  
However, with the given files it is possible to create different output than in the original run.  
For example, it is possible to change the output mode, maximal softclip ratio and extended CIGAR format for the final output. We also recommend to always change the output directory:

```
> hilive-out --config ./temp/hilive_config.ini -O 208 --out-mode ALL --max-softclip-ratio 0.1 --extended-cigar --out-dir ./out_new
```


Additional remarks
------------------

For errors in the tutorial or in the description, please create an issue on https://gitlab.com/lokat/hilive2/issues