1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
|
/*Package fai implements fasta sequence file index handling, including creating
, reading and random accessing.
Code of fai data structure were copied and edited from [1].
But I wrote the code of creating and reading fai, and so did test code.
Code of random accessing subsequences were copied from [2], but I extended them a lot.
Reference:
[1]. https://github.com/biogo/biogo/blob/master/io/seqio/fai/fai.go
[2]. https://github.com/brentp/faidx/blob/master/faidx.go
## General Usage
import "github.com/shenwei356/bio/seqio/fai"
file := "seq.fa"
faidx, err := fai.New(file)
checkErr(err)
defer func() {
checkErr(faidx.Close())
}()
// whole sequence
seq, err := faidx.Seq("cel-mir-2")
checkErr(err)
// single base
s, err := faidx.Base("cel-let-7", 1)
checkErr(err)
// subsequence. start and end are all 1-based
seq, err := faidx.SubSeq("cel-mir-2", 15, 19)
checkErr(err)
## Extended SubSeq
For extended SubSeq, negative position is allowed.
This is my custom locating strategy. Start and end are all 1-based.
To better understand the locating strategy, see examples below:
1-based index 1 2 3 4 5 6 7 8 9 10
negative index 0-9-8-7-6-5-4-3-2-1
seq A C G T N a c g t n
1:1 A
2:4 C G T
-4:-2 c g t
-4:-1 c g t n
-1:-1 n
2:-2 C G T N a c g t
1:-1 A C G T N a c g t n
1:12 A C G T N a c g t n
-12:-1 A C G T N a c g t n
Examples:
// last 12 bases
seq, err := faidx.SubSeq("cel-mir-2", -12, -1)
checkErr(err)
## Advanced Usage
Function `fai.New(file string)` is a wraper to simplify the process of
creating and reading FASTA index . Let's see what's happened inside:
func New(file string) (*Faidx, error) {
fileFai := file + ".fai"
var index Index
if _, err := os.Stat(fileFai); os.IsNotExist(err) {
index, err = Create(file)
if err != nil {
return nil, err
}
} else {
index, err = Read(fileFai)
if err != nil {
return nil, err
}
}
return NewWithIndex(file, index)
}
By default, sequence ID is used as key in FASTA index file.
Inside the package, a regular expression is used to get sequence ID from
full head. The default value is `^([^\s]+)\s?`, i.e. getting
first non-space characters of head.
So you can just use `fai.Create(file string)` to create .fai file.
If you want to use full head instead of sequence ID (first non-space characters of head),
you could use `fai.CreateWithIDRegexp(file string, idRegexp string)` to create faidx.
Here, the `idRegexp` should be `^(.+)$`. For convenience, you can use another function
`CreateWithFullHead`.
## More Advanced Usages
Note that, ***by default, whole file is mapped into shared memory***,
which is OK for small files (smaller than your RAM).
For very big files, you should disable that.
Instead, file seeking is used.
// change the global variable
fai.MapWholeFile = false
// then do other things
*/
package fai
|