1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
|
#!/bin/bash
usage(){
echo "
Written by Brian Bushnell
Last modified March 19, 2025
Description: Estimates the strandedness of a library without alignment;
intended for RNA-seq data. Only the reads are required input to determine
strandedness, so this works when run with just a fastq file. If sam/bam
input is used, additional alignment-based metrics will be reported.
If an assembly and gff file are provided, the affinity of reads to the plus
or minus (sense or antisense) strand will also be calculated. If a genome
is specified with no gff file, the genes will be called automatically with a
prokaryotic gene caller. Strandedness and P/(P+M) ratios are similar but
calculated in different ways, and thus will not exactly agree, but should be
close. For most calculations, only read 1 is used, or the merge of read 1
and read 2 if the merge flag is enabled and they overlap.
Output meaning:
Depth_Analysis: Based on comparing the fraction of kmers in forward and
reverse orientation to a binomial distribution.
Strandedness: Percent of reads that came from the majority strand, based
on kmer depth.
StrandednessN: Depth-normalized strandedness, where each unique kmer
contributes equally regardless of depth.
AvgKmerDepth: Average depth of kmers; typically higher than read depth.
Kmers>Depth1: Fraction of kmers with depth over 1. Singleton kmers cannot
be used to calculate strandedness from depth.
Stop_Codon_Analysis: Based on counting stop codons.
MajorStrandORF: Predicted major strand based on stop-codon analysis.
AvgReadLen: Average length of R1, or the merged read pair.
AvgORFLen+: Average ORF length in the best frame on the plus side of R1.
AvgORFLen-: Average ORF length in the best frame on the minus side of R1.
AvgStopCount+: Average number of stop codons in the best frame on plus side.
AvgStopCount-: Average number of stop codons in the best frame on minus side.
GC_Content: GC content of reads, which affects stop codon frequency.
PolyA_Analysis: Based on counting poly-A or poly-T tails.
MajorStrandPA: Predicted major strand based on poly-A tail analysis.
This is unreliable if the poly-A fraction is low.
PolyA/(PA+PT): Ratio of (reads with poly-A tails)/(poly-A + poly-T tails).
PolyAFraction: Fraction of reads ending in poly-A or poly-T.
Ref_Analysis: Compares read kmer frequencies to a reference (if present).
The reference can be a transcriptome or genome, and a gff file
will be used if provided.
MajorStrandREF: The strand containing a majority of forward read kmers.
P/(P+M)_Ratio: P is the sum of counts of plus kmers, and M is minus kmers.
GeneCoverage: Fraction of transcriptome kmers represented in reads.
GenePrecision: Fraction of read kmers found in the transcriptome.
Read_Gene_Calling_Analysis: Uses gene-calling on reads to calculate which
strand better fits a prokaryotic gene model.
MajorStrandRGC: Predicted major strand based on read gene calling.
P/(P+M)_Ratio: P is the read count best matching the plus strand; M is minus.
AvgScorePlus: Average score of called plus-strand genes.
AvgScoreMinus: Average score of called plus-strand genes.
UsedFraction: Fraction of reads with any called genes (or partial genes);
this can be increased by merging the reads for longer frames.
Alignment_Results: Requires sam/bam input. The reads must have been
mapped to a transcriptome or RNA-seq assembly, or to a
specified genome, or a gff file must be provided.
StrandednessAL: Percent of reads aligned to the dominant strand. More
accurate for transcriptome-mapped than genome-mapped reads.
StrandednessAN: Depth-normalized strandedness, where each feature or
contig contributes equally.
MajorStrandAL: Strand to which a majority of reads aligned.
P/(P+M)_Ratio: P is the number of plus-mapped reads, M is minus.
P/(P+M)_RatioN: Depth-normalized plus/total ratio.
PlusFeatures: Fraction of features with majority plus-mapped reads.
AlignmentRate: Fraction of reads that aligned.
Feature-Mapped: Fraction of reads that aligned to a feature in the gff.
Usage: checkstrand.sh in=<input file>
Running on a fastq is simple, but there are multiple ways to run CheckStrand
on aligned data (in=, ref=, and gff= flags are not needed if the files have
proper extensions):
#1) This won't give alignment results, just kmer results
checkstrand.sh mapped.sam
#2) This will do gene-calling and the alignment strandedness will be based
on gene sense strand, but only works for prokaryotes/viruses
checkstrand.sh mapped.sam contigs.fa
#3) This will use the annotation and the alignment strandedness will be based
on gene sense strand, works for proks, and should work for eukaryotes
(there are lots of ways to annotate multi-exon genes though)
checkstrand.sh mapped.sam genes.gff
#4) This will assume that the reference was a sense-strand transcriptome,
and the alignment strandedness will be based on contig plus strand
checkstrand.sh mapped.sam transcriptome
#5) This will assume that the reference was unstranded contigs assembled
from RNA-seq data, so the alignment strandedness will be based on
contig majority strand
checkstrand.sh mapped.sam rnacontigs
Standard parameters:
in=<file> Primary input (a fastq, fasta, sam or bam file).
in2=<file> Secondary input for paired fastq in twin files. Read 2 is
ignored unless merge=t.
out=<file> Optional destination to redirect results instead of stdout.
outp=<file> Optional output for plus-mapped reads.
outm=<file> Optional output for minus-mapped reads.
*Note: outp/outm require sam/bam input and either transcriptome mode or a gff.
The destination of plus-mapped r1 would be outp, but outm for plus-mapped r2.
Processing parameters:
ref=<file> Optional reference (assembly) input.
gff=<file> Optional gene annotation file.
transcriptome=f Set this to 't' if the reference is a sense-strand
transcriptome (rather than a genome assembly). This applies
to either a reference specified by 'ref' or the reference
used for alignment, fo sam/bam input.
rnacontigs=f Set this to 't' if the reference is contigs assembled from
RNA-seq data, but with unknown orientation. Only affects
alignment results.
size=80000 Sketch size; larger may be more precise.
merge=f Attempt to merge paired reads, and use the merged read when
successful. If unsuccessful only R1 is used. This has a
performance impact on CPUs with few cores.
orf=t Analyze stop codons and open reading frames. Usually this
will allow major strand determination without a reference,
unless GC is very high.
callreads=t Perform gene-calling on reads using a prokarotic gene model.
Not very relevant to eukaryotes.
passes=2 Two passes refines the gene model for better gene-calling on
reads. Only used if there is a reference.
samplerate=1.0 Set to a lower number to subsample the input reads; increases
speed on CPUs with few cores.
sampleseed=17 Positive numbers are deterministic; negative use random seeds.
reads=-1 If positive, quit after processing this many reads.
Java Parameters:
-Xmx This will set Java's memory usage, overriding autodetection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
specify 200 megs. The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory
exception occurs. Requires Java 8u92+.
-da Disable assertions.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}
#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
cd "$(dirname "$DIR")"
DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null
#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"
z="-Xmx2g"
#z2="-Xms2g"
set=0
if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
usage
exit
fi
calcXmx () {
source "$DIR""/calcmem.sh"
setEnvironment
parseXmx "$@"
}
calcXmx "$@"
checkstrand() {
local CMD="java $EA $EOOM $z -cp $CP jgi.CheckStrand2 $@"
echo $CMD >&2
eval $CMD
}
checkstrand "$@"
|