1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
ChIP-Seq auxiliary Perl and C applications
============================================================================
In this directory we have collected a series of Perl scripts and C programs that can be used to perform format conversion tasks as well as other auxiliary tasks such as read counts filtering or SAG file compression.
The ChIP-seq main programs use as a format a simplified BED format, called SGA (Simplified Genome Annotation), which is sorted by sequence name and position.
In a typical data analysis pipeline, the SGA file is often generated from a variety of richer formats, such as the Solexa genome mapping foramt, BAM BED, or FPS (Functional Position Set). The latter is used by the Signal Search Analysis programs at SIB (SSA).
We therefore provide simple and fast tools to convert SGA data files to other formats, especially BED, WIG (Wiggle Track Format) and FPS, and vice-versa.
WIG and BED files are used for viewing ChIP-seq data and results at the UCSC genome browser.
The binary file chro_idx.nstorage includes a Perl hash table that, for each supported assembly, stores chromosome number-NCBI identifier pairs as well as chromosome lengths indexed by chromosome NCBI identifiers.
This file is used by most conversion scripts. When required, its location (<path>) must by set by using the --db <path> option.
The text file chr_NC_gi is used by the C format conversion programs (such as sga2bed, bed2sga, etc) to generate a hash table to link chromosome numbers to NCBI RefSeq identifiers and viceversa. When required, its location (<path>) must by set by using the -i|--db <path> option.
The text file chr_size is used by a few C program to fetch the chromosome size based on the corresponding chromosome NCBI identifier. In this way, it is possible to check whether the genome coordinates of the data files go beyong the chromosome boundaries.
|