## Reading genomic dataframes

In [1]:
import bioframe

Bioframe provides multiple methods to convert data stored in common genomic file formats to pandas dataFrames in `bioframe.io`.



### Reading tabular text data

The most common need is to read tablular data, which can be accomplished with `bioframe.read_table`. This function wraps pandas `pandas.read_csv`/`pandas.read_table` (tab-delimited by default), but allows the user to easily pass a **schema** (i.e. list of pre-defined column names) for common genomic interval-based file formats. 

For example, 

In [12]:
df = bioframe.read_table(
    'https://www.encodeproject.org/files/ENCFF001XKR/@@download/ENCFF001XKR.bed.gz', 
    schema='bed9'
)
display(df[0:3])

Unnamed: 0,chrom,start,end,name,score,strand,thickStart,thickEnd,rgb
0,chr1,193500,194500,.,400,+,.,.,179450
1,chr1,618500,619500,.,700,+,.,.,179450
2,chr1,974500,975500,.,1000,+,.,.,179450


In [11]:
df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz", 
     schema='narrowPeak')
display(df[0:3])

Unnamed: 0,chrom,start,end,name,score,strand,fc,-log10p,-log10q,relSummit
0,chr19,48309541,48309911,.,1000,.,5.04924,-1.0,0.00438,185
1,chr4,130563716,130564086,.,993,.,5.05052,-1.0,0.00432,185
2,chr1,200622507,200622877,.,591,.,5.05489,-1.0,0.004,185


In [14]:
df = bioframe.read_table(
    'https://www.encodeproject.org/files/ENCFF001VRS/@@download/ENCFF001VRS.bed.gz', 
     schema='bed12'
)
display(df[0:3])

The `schema` argument looks up file type from a registry of schemas stored in the `bioframe.SCHEMAS` dictionary:

In [62]:
bioframe.SCHEMAS['bed6']

['chrom', 'start', 'end', 'name', 'score', 'strand']

### UCSC Big Binary Indexed files (BigWig, BigBed)

Bioframe also has convenience functions for reading and writing bigWig and bigBed binary files to and from pandas DataFrames.

In [70]:
bw_url = 'http://genome.ucsc.edu/goldenPath/help/examples/bigWigExample.bw'
df = bioframe.read_bigwig(bw_url, "chr21", start=10_000_000, end=10_010_000)
df.head(5)

Unnamed: 0,chrom,start,end,value
0,chr21,10000000,10000005,40.0
1,chr21,10000005,10000010,40.0
2,chr21,10000010,10000015,60.0
3,chr21,10000015,10000020,80.0
4,chr21,10000020,10000025,40.0
...,...,...,...,...
1995,chr21,10009975,10009980,40.0
1996,chr21,10009980,10009985,60.0
1997,chr21,10009985,10009990,60.0
1998,chr21,10009990,10009995,20.0


In [83]:
df['value'] *= 100
df.head(5)

Unnamed: 0,chrom,start,end,value
0,chr21,10000000,10000005,400000.0
1,chr21,10000005,10000010,400000.0
2,chr21,10000010,10000015,600000.0
3,chr21,10000015,10000020,800000.0
4,chr21,10000020,10000025,400000.0


In [82]:
chromsizes = bioframe.fetch_chromsizes('hg19')
bioframe.to_bigwig(df, chromsizes, 'times100.bw')  
# note: requires UCSC bedGraphToBigWig binary, which can be installed as
# !conda install -y -c bioconda ucsc-bedgraphtobigwig

CompletedProcess(args=['bedGraphToBigWig', '/var/folders/4s/d866wm3s4zbc9m41334fxfwr0000gp/T/tmpdvz4xpzu.bg', '/var/folders/4s/d866wm3s4zbc9m41334fxfwr0000gp/T/tmp00_9n7bj.chrom.sizes', 'times100.bw'], returncode=0, stdout=b'', stderr=b'')

In [84]:
bb_url = 'http://genome.ucsc.edu/goldenPath/help/examples/bigBedExample.bb'
bioframe.read_bigbed(bb_url, "chr21", start=48000000).head()

Unnamed: 0,chrom,start,end
0,chr21,48003453,48003785
1,chr21,48003545,48003672
2,chr21,48018114,48019432
3,chr21,48018244,48018550
4,chr21,48018843,48019099


### Reading genome assembly information

The most fundamental information about a genome assembly is its set of chromosome sizes.

    

Bioframe provides functions to read chromosome sizes file as `pandas.Series`, with some useful filtering and sorting options:

In [40]:
bioframe.read_chromsizes(
    'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
)

chr1     248956422
chr2     242193529
chr3     198295559
chr4     190214555
chr5     181538259
chr6     170805979
chr7     159345973
chr8     145138636
chr9     138394717
chr10    133797422
chr11    135086622
chr12    133275309
chr13    114364328
chr14    107043718
chr15    101991189
chr16     90338345
chr17     83257441
chr18     80373285
chr19     58617616
chr20     64444167
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64

In [41]:
bioframe.read_chromsizes('https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes',
                         filter_chroms=False)

chr1                248956422
chr2                242193529
chr3                198295559
chr4                190214555
chr5                181538259
                      ...    
chrUn_KI270539v1          993
chrUn_KI270385v1          990
chrUn_KI270423v1          981
chrUn_KI270392v1          971
chrUn_KI270394v1          970
Name: length, Length: 455, dtype: int64

In [42]:
dm6_url = 'https://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/chromInfo.txt.gz'

In [43]:
bioframe.read_chromsizes(dm6_url,
                         filter_chroms=True, 
                         chrom_patterns=("^chr2L$", "^chr2R$", "^chr3L$", "^chr3R$", "^chr4$", "^chrX$")
)

chr2L    23513712
chr2R    25286936
chr3L    28110227
chr3R    32079331
chr4      1348131
chrX     23542271
Name: length, dtype: int64

In [44]:
bioframe.read_chromsizes(dm6_url, chrom_patterns=["^chr\d+L$", "^chr\d+R$", "^chr4$", "^chrX$", "^chrM$"])

chr2L    23513712
chr3L    28110227
chr2R    25286936
chr3R    32079331
chr4      1348131
chrX     23542271
chrM        19524
Name: length, dtype: int64

Bioframe provides a convenience function to fetch chromosome sizes from UCSC given an assembly name:

In [45]:
chromsizes = bioframe.fetch_chromsizes('hg38')
chromsizes[-5:]

chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64

In [24]:
# # bioframe also has locally stored information for certain assemblies that can be 
# # read as follows 
# bioframe.get_seqinfo()
# bioframe.get_chromsizes('hg38', unit='primary', type=('chromosome', 'non-nuclear'), )

Bioframe can also generate a list of centromere positions using information from some UCSC assemblies:

In [51]:
display(
    bioframe.fetch_centromeres('hg38')[:3]
)

Unnamed: 0,chrom,start,end,mid
0,chr1,122026459,124932724,123479591
1,chr10,39686682,41593521,40640101
2,chr11,51078348,54425074,52751711


These functions are just wrappers for a UCSC client. Users can also use `UCSCClient` directly:

In [54]:
client = bioframe.UCSCClient('hg38')
client.fetch_cytoband()

Unnamed: 0,chrom,start,end,name,gieStain
0,chr1,0,2300000,p36.33,gneg
1,chr1,2300000,5300000,p36.32,gpos25
2,chr1,5300000,7100000,p36.31,gneg
3,chr1,7100000,9100000,p36.23,gpos25
4,chr1,9100000,12500000,p36.22,gneg
...,...,...,...,...,...
1473,chr9_ML143353v1_fix,0,25408,,gneg
1474,chrX_ML143385v1_fix,0,17435,,gneg
1475,chrX_ML143384v1_fix,0,14678,,gneg
1476,chr22_ML143379v1_fix,0,12295,,gneg


### Reading other genomic formats

See the [docs for File I/O](https://bioframe.readthedocs.io/en/latest/api-fileops.html) for other supported file formats.