1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103
|
# Directory containing some scripts to post-process and filter discoSnpRad results
## Filtering scripts
1. **script** `filter_by_cluster_size_and_rank.py`:
* removes variants belonging to a cluster (locus) whose size (nb of variants) is outside the given size range (options `-m` and `-M`)
* removes variants with rank lower than a given threshold given by option `-r`
* Usage :
`python filter_by_cluster_size_and_rank.py -i vcf_file [-o new_vcf_file -m 0 -M 150 -r 0.4]`
3. **script** `filter_vcf_by_indiv_cov_max_missing_and_maf.py`:
* replaces individual genotypes that have DP less than the value given by option `-c` by missing genotype `./.`
* removes variants (vcf lines) that have a fraction of missing genotypes greater than the value given by option `-m`
* removes variants (vcf lines) that have a minor allele frequency smaller than the value given by option `-f`
* outputs only SNP variants if option `-s`.
* Usage :
`python filter_vcf_by_indiv_cov_max_missing_and_maf.py -i vcf_file -o new_vcf_file [-c min_cov -m max_missing -f maf -s] `
3. **script** `filter_paralogs.py`:
* identifies variants (vcf lines) that have a fraction of heterozygous genotypes greater than `x` (not counting missing genotypes)
* removes variants (vcf lines) that belong to a cluster having a fraction of such variants greater than `y`
* Example : `x=0.1` and `y= 0.5` and if we consider a cluster to represent a locus. This filter removes loci that have more than 50% of the SNPs that have each more than 10% of heterozygous genotypes.
* Usage :
`python filter_paralogs.py -i vcf_file -o new_vcf_file [-x 0.1 -y 0.5]`
## Scripts for STRUCTURE analyses :
4. **script** `1SNP_per_cluster.py`
* selects one SNP per cluster (the one with less missing genotypes)
* Usage :
`python 1SNP_per_cluster.py -i vcf_file -o new_vcf_file`
5. **script** `vcf2structure.sh`
* changes the vcf format to a Structure format (input of the software Structure)
* Usage:
`sh vcf2structure.sh file.vcf`
(creates a file `file.str`)
## Mapping to a reference genome, and keeping the cluster information :
When a reference genome is available, even if variants have been called in a reference-free manner, it could be useful to get the positions of variants on this reference genome.
To get such information, two programs are necessary in the case of a discoSnpRAD result:
* `VCF_creator` (in dir `[DISCO_DIR]/scripts/`).
Note : it uses the mapper bwa, which must to be in the PATH env variable.
* **script** `add_cluster_info_to_mapped_vcf.py` in this current directory, to append the clustering information (and some minimal filtering on cluster size) in the vcf output by VCF_creator.
Here is the full pipeline to map the result of discoSnpRAD with name prefix `myDiscoSnpRADResult` to a reference genome `myReferenceGenome.fa`:
```
# Running VCF_creator with a given reference genome ()
sh [DISCO_DIR]/scripts/run_VCF_creator.sh -G dm6_masked.fa -p myDiscoSnpRADResult_raw_filtered.fa -e -o temp.vcf
# Adding clustering information (and minimal filtering on cluster size)
python add_cluster_info_to_mapped_vcf.py -m temp.vcf -u myDiscoSnpRADResult_clustered.vcf -o myDiscoSnpRADResult_mapped.vcf
# final vcf is myDiscoSnpRADResult_mapped.vcf
```
Additionnally, in a validation context, if one wants to compare variant positions between two such vcf files, the following command will output recall and precision metrics:
```
python [DISCO_DIR]/scripts/validation_scripts/compare_vcf_disco_pos_allele_only.py truth.vcf myDiscoSnpRADResult_mapped.vcf
```
<!---
== Clustering full process ==
-----------------------------
`ls ${discofile}.fa > ${discofile}.fof `
`short_read_connector.sh -b ${discofile}.fa -q ${discofile}.fof -s 0 -k ${k} -a
1 -l -p ${discofile}.txt`
`./quick_hierarchical_clustering ${discofile}.txt > ${discofile}.cluster`
`python3 clusters_and_fasta_to_fasta.py ${discofile}.fa ${discofile}.cluster >
${discofile}_with_clusters.fa`
`run_VCF_creator.sh -p ${discofile}_with_clusters.fa
-o ${discofile}_with_clusters.vcf`
**NB**: Then the `${discofile}_with_clusters.vcf` can be simply sorted to put
together variants from each cluster.
**Sources**:
`short_read_connector.sh` from <https://github.com/GATB/short_read_connector>
`run_VCF_creator.sh` from <https://github.com/GATB/DiscoSnp> (scripts directory)
== Old scripts ==
* `filter_missgeno.py" : filtering out variants with too many missing genotypes
* From a .fa file generated by discoSnp-Rad, remove variants with too many missing genotypes
* Usage:
`filter_missgeno.py file max_missing` (with max_missing in percent)
-->
|