1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
|
#!/usr/bin/env perl
use strict;
use warnings;
use lib ("$ENV{TRINITY_HOME}/PerlLib");
use Gene_obj;
use Fasta_reader;
use GFF3_utils;
use Carp;
use Nuc_translator;
my $MIN_LENGTH = 1000;
my $MIN_ISO_COUNT = 2;
my $usage = "\n\nusage: $0 gff3_file genome_db [flank]\n\n";
my $gff3_file = $ARGV[0] or die $usage;
my $fasta_db = $ARGV[1] or die $usage;
my $flank = $ARGV[2] || 0;
my $fasta_reader = new Fasta_reader($fasta_db);
my %genome = $fasta_reader->retrieve_all_seqs_hash();
my $gene_obj_indexer_href = {};
## associate gene identifiers with contig id's.
my $contig_to_gene_list_href = &GFF3_utils::index_GFF3_gene_objs($gff3_file, $gene_obj_indexer_href);
my $outdir = "gene_contigs";
mkdir $outdir or die "Error, $outdir already exists";
my $gene_contigs_per_bin = 100;
my $gene_counter = 0;
foreach my $asmbl_id (sort keys %$contig_to_gene_list_href) {
my $genome_seq = $genome{$asmbl_id} or die "Error, cannot find sequence for $asmbl_id"; #cdbyank_linear($asmbl_id, $fasta_db);
my @gene_ids = @{$contig_to_gene_list_href->{$asmbl_id}};
foreach my $gene_id (@gene_ids) {
my $gene_id_for_filename = $gene_id;
$gene_id_for_filename =~ s/\W/_/g;
my $gene_obj_ref = $gene_obj_indexer_href->{$gene_id};
my ($gene_contig_lend, $gene_contig_rend) = sort {$a<=>$b} $gene_obj_ref->get_gene_span();
$gene_contig_lend -= $flank;
$gene_contig_rend += $flank;
$gene_obj_ref->adjust_gene_coordinates(-1 * ($gene_contig_lend-1));
# update gene identifiers.
my @iso_objs;
foreach my $iso_obj ($gene_obj_ref, $gene_obj_ref->get_additional_isoforms()) {
$iso_obj->{asmbl_id} = $gene_id;
my $cdna_len = $iso_obj->get_cDNA_length();
if ($cdna_len >= $MIN_LENGTH) {
push (@iso_objs, $iso_obj);
}
}
if (scalar @iso_objs >= $MIN_ISO_COUNT) {
$gene_obj_ref->delete_isoforms();
my $gene_obj = shift @iso_objs;
$gene_obj->add_isoform(@iso_objs);
$gene_counter++;
my $bindir = "$outdir/bin_" . int($gene_counter/$gene_contigs_per_bin) . "/$gene_id_for_filename";
print STDERR "[$gene_counter] -processing $bindir\n";
&process_cmd("mkdir -p $bindir");
my $contig_filename = "$bindir/gene.fa";
open (my $ofh, ">$contig_filename") or die $!;
my $gene_seq = substr($genome_seq, $gene_contig_lend-1, $gene_contig_rend - $gene_contig_lend + 1);
print $ofh ">$gene_id\n$gene_seq\n";
close $ofh;
my $gff3_filename = "$bindir/gene.gff3";
open ($ofh, ">$gff3_filename") or die $!;
print $ofh $gene_obj->to_GFF3_format();
close $ofh;
}
if ($gene_counter > 10) { last; }
}
}
exit(0);
####
sub process_cmd {
my ($cmd) = @_;
my $ret = system($cmd);
if ($ret) {
die "Error, CMD: $cmd died with ret $ret";
}
return;
}
|