1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
|
#!/usr/bin/env perl
use strict;
use warnings;
my $usage = "usage: $0 blast.grouped\n\n";
my $blast_out = $ARGV[0] or die $usage;
main: {
my $counter = 0;
my %query_to_top_hit; # only storing the hit with the greatest blast score.
# outfmt6:
# qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
open (my $fh, $blast_out) or die "Error, cannot open file $blast_out";
while (<$fh>) {
if (/^\#/) { next; }
chomp;
my $line = $_;
my @x = split(/\t/);
my $query_id = $x[0];
my $db_id = $x[1];
my $percent_id = $x[2];
my $Evalue = $x[3];
my $pct_target_len = $x[7];
if ( (! exists $query_to_top_hit{$query_id}) || ($Evalue < $query_to_top_hit{$query_id}->{Evalue})
||
($Evalue == $query_to_top_hit{$query_id}->{Evalue}
&&
$pct_target_len > $query_to_top_hit{$query_id}->{pct_target_len} )
) {
$query_to_top_hit{$query_id} = { query_id => $query_id,
db_id => $db_id,
percent_id => $percent_id,
Evalue => $Evalue,
pct_target_len => $pct_target_len,
};
}
}
close $fh;
## get the best transcript hit per db ID
my %db_id_to_trans_hits;
foreach my $hit_struct (values %query_to_top_hit) {
my $db_id = $hit_struct->{db_id};
push (@{$db_id_to_trans_hits{$db_id}}, $hit_struct);
}
## histogram summary
my @bins = qw(10 20 30 40 50 60 70 80 90 100);
my %bin_counts;
foreach my $db_id (keys %db_id_to_trans_hits) {
my @hit_structs = @{$db_id_to_trans_hits{$db_id}};
@hit_structs = sort {$a->{Evalue} <=> $b->{Evalue}
||
$b->{pct_target_len} <=> $a->{pct_target_len} } @hit_structs;
my $entry = shift @hit_structs; # take the lowest E-value w/ longest pct_target_len
my $pct_cov = $entry->{pct_target_len};
my $prev_bin = 0;
foreach my $bin (@bins) {
if ($pct_cov > $prev_bin && $pct_cov <= $bin) {
$bin_counts{$bin}++;
}
$prev_bin = $bin;
}
}
## Report counts per bin
print "#hit_pct_cov_bin\tcount_in_bin\t>bin_below\n";
my $cumul = 0;
foreach my $bin (reverse(@bins)) {
my $count = $bin_counts{$bin} || 0;
$cumul += $count;
print join("\t", $bin, $count, $cumul) . "\n";
}
exit(0);
}
|