## Reading genomic dataframes

In [None]:
import bioframe

Bioframe provides multiple methods to convert data stored in common genomic file formats to pandas dataFrames in `bioframe.io`.



### Reading tabular text data

The most common need is to read tablular data, which can be accomplished with `bioframe.read_table`. This function wraps pandas `pandas.read_csv`/`pandas.read_table` (tab-delimited by default), but allows the user to easily pass a **schema** (i.e. list of pre-defined column names) for common genomic interval-based file formats. 

For example, 

In [None]:
df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF001XKR/@@download/ENCFF001XKR.bed.gz",
    schema="bed9",
)
display(df[0:3])

In [None]:
df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
    schema="narrowPeak",
)
display(df[0:3])

In [None]:
df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF001VRS/@@download/ENCFF001VRS.bed.gz",
    schema="bed12",
)
display(df[0:3])

The `schema` argument looks up file type from a registry of schemas stored in the `bioframe.SCHEMAS` dictionary:

In [None]:
bioframe.SCHEMAS["bed6"]

### UCSC Big Binary Indexed files (BigWig, BigBed)

Bioframe also has convenience functions for reading and writing bigWig and bigBed binary files to and from pandas DataFrames.

In [None]:
bw_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigWigExample.bw"
df = bioframe.read_bigwig(bw_url, "chr21", start=10_000_000, end=10_010_000)
df.head(5)

In [None]:
df["value"] *= 100
df.head(5)

In [None]:
chromsizes = bioframe.fetch_chromsizes("hg19")
# bioframe.to_bigwig(df, chromsizes, 'times100.bw')

# note: requires UCSC bedGraphToBigWig binary, which can be installed as
# !conda install -y -c bioconda ucsc-bedgraphtobigwig

In [None]:
bb_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigBedExample.bb"
bioframe.read_bigbed(bb_url, "chr21", start=48000000).head()

### Reading genome assembly information

The most fundamental information about a genome assembly is its set of chromosome sizes.

    

Bioframe provides functions to read chromosome sizes file as `pandas.Series`, with some useful filtering and sorting options:

In [None]:
bioframe.read_chromsizes(
    "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes"
)

In [None]:
bioframe.read_chromsizes(
    "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
    filter_chroms=False,
)

In [None]:
dm6_url = "https://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/chromInfo.txt.gz"

In [None]:
bioframe.read_chromsizes(
    dm6_url,
    filter_chroms=True,
    chrom_patterns=("^chr2L$", "^chr2R$", "^chr3L$", "^chr3R$", "^chr4$", "^chrX$"),
)

In [None]:
bioframe.read_chromsizes(
    dm6_url, chrom_patterns=[r"^chr\d+L$", r"^chr\d+R$", "^chr4$", "^chrX$", "^chrM$"]
)

Bioframe provides a convenience function to fetch chromosome sizes from UCSC given an assembly name:

In [None]:
chromsizes = bioframe.fetch_chromsizes("hg38")
chromsizes[-5:]

Bioframe can also generate a list of centromere positions using information from some UCSC assemblies:

In [None]:
display(bioframe.fetch_centromeres("hg38")[:3])

These functions are just wrappers for a UCSC client. Users can also use `UCSCClient` directly:

In [None]:
client = bioframe.UCSCClient("hg38")
client.fetch_cytoband()

### Curated genome assembly build information

_New in v0.5.0_

Bioframe also has locally stored information for common genome assembly builds. 

For a given provider and assembly build, this API provides additional sequence metadata:

* A canonical **name** for every sequence, usually opting for UCSC-style naming.
* A canonical **ordering** of the sequences.
* Each sequence's **length**.
* An **alias dictionary** mapping alternative names or aliases to the canonical sequence name.
* Each sequence is assigned to an assembly **unit**: e.g., primary, non-nuclear, decoy.
* Each sequence is assigned a **role**: e.g., assembled molecule, unlocalized, unplaced.

In [None]:
bioframe.assemblies_available()

In [None]:
hg38 = bioframe.assembly_info("hg38")
print(hg38.provider, hg38.provider_build)
hg38.seqinfo

In [None]:
hg38.chromsizes

In [None]:
hg38.alias_dict["MT"]

In [None]:
bioframe.assembly_info("hg38", roles="all").seqinfo

### Contributing metadata for a new assembly build

To contribute a new assembly build to bioframe's internal metadata registry, make a pull request with the following items:

1. Add a record to the assembly manifest file located at `bioframe/io/data/_assemblies.yml`. Required fields are as shown in the example below.
2. Create a `seqinfo.tsv` file for the new assembly build and place it in `bioframe/io/data`. Reference the exact file name in the manifest record's `seqinfo` field. The seqinfo is a tab-delimited file with a required header line as shown in the example below.
3. Optionally, a `cytoband.tsv` file adapted from a `cytoBand.txt` file from UCSC.

Note that we currently do not include sequences with alt or patch roles in seqinfo files.

#### Example

Metadata for the mouse mm9 assembly build as provided by UCSC.

`_assemblies.yml`

> ```
> ...
> - organism: mus musculus
>   provider: ucsc
>   provider_build: mm9
>   release_year: 2007
>   seqinfo: mm9.seqinfo.tsv
>   default_roles: [assembled]
>   default_units: [primary, non-nuclear]
>   url: https://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/
> ...
> ```

`mm9.seqinfo.tsv`

> ```
> name	length	role	molecule	unit	aliases
> chr1	197195432	assembled	chr1	primary	1,CM000994.1,NC_000067.5
> chr2	181748087	assembled	chr2	primary	2,CM000995.1,NC_000068.6
> chr3	159599783	assembled	chr3	primary	3,CM000996.1,NC_000069.5
> chr4	155630120	assembled	chr4	primary	4,CM000997.1,NC_000070.5
> chr5	152537259	assembled	chr5	primary	5,CM000998.1,NC_000071.5
> chr6	149517037	assembled	chr6	primary	6,CM000999.1,NC_000072.5
> chr7	152524553	assembled	chr7	primary	7,CM001000.1,NC_000073.5
> chr8	131738871	assembled	chr8	primary	8,CM001001.1,NC_000074.5
> chr9	124076172	assembled	chr9	primary	9,CM001002.1,NC_000075.5
> chr10	129993255	assembled	chr10	primary	10,CM001003.1,NC_000076.5
> chr11	121843856	assembled	chr11	primary	11,CM001004.1,NC_000077.5
> chr12	121257530	assembled	chr12	primary	12,CM001005.1,NC_000078.5
> chr13	120284312	assembled	chr13	primary	13,CM001006.1,NC_000079.5
> chr14	125194864	assembled	chr14	primary	14,CM001007.1,NC_000080.5
> chr15	103494974	assembled	chr15	primary	15,CM001008.1,NC_000081.5
> chr16	98319150	assembled	chr16	primary	16,CM001009.1,NC_000082.5
> chr17	95272651	assembled	chr17	primary	17,CM001010.1,NC_000083.5
> chr18	90772031	assembled	chr18	primary	18,CM001011.1,NC_000084.5
> chr19	61342430	assembled	chr19	primary	19,CM001012.1,NC_000085.5
> chrX	166650296	assembled	chrX	primary	X,CM001013.1,NC_000086.6
> chrY	15902555	assembled	chrY	primary	Y,CM001014.1,NC_000087.6
> chrM	16299	assembled	chrM	non-nuclear	MT,AY172335.1,NC_005089.1
> chr1_random	1231697	unlocalized	chr1	primary	
> chr3_random	41899	unlocalized	chr3	primary	
> chr4_random	160594	unlocalized	chr4	primary	
> chr5_random	357350	unlocalized	chr5	primary	
> chr7_random	362490	unlocalized	chr7	primary	
> chr8_random	849593	unlocalized	chr8	primary	
> chr9_random	449403	unlocalized	chr9	primary	
> chr13_random	400311	unlocalized	chr13	primary	
> chr16_random	3994	unlocalized	chr16	primary	
> chr17_random	628739	unlocalized	chr17	primary	
> chrX_random	1785075	unlocalized	chrX	primary	
> chrY_random	58682461	unlocalized	chrY	primary	
> chrUn_random	5900358	unplaced		primary	
> ```

### Reading other genomic formats

See the [docs for File I/O](https://bioframe.readthedocs.io/en/latest/api-fileops.html) for other supported file formats.