1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
|
#!/usr/bin/python3
import configparser
import sys
import getopt
import os
import time
helpMessage = """
Usage:
GenerateConfig.py
Note: The goal of this script is to generate a good starter configuration for the
Shasta long read assembler.
Each genome is different and reads differ in quality. So it is impossible to generate a perfect
configuration programmatically. If you'd like to get some suggestions on how to tweak the
configuration generated by this script, you should use the `GetFeedback.py` script after
you run your assembly.
If you're still not happy with the assembly after using these two scripts, please file a Github
issue and we can help.
"""
def usage():
print(helpMessage)
return
def generateConfig(
genomeSize,
baseCallerId,
enableDetangling,
areUltraLongReads,
):
minReadLength = 10000
if areUltraLongReads:
minReadLength = 40000
# The following Align parameters are set to very permissive values to allow the majority of alignments
# to be assessed during the initial stage of automatic alignment parameter selection
maxSkip = 100
maxDrift = 100
maxTrim = 100
minAlignedMarkerCount = 10
minAlignedFraction = 0.1
# This method uses the observed distribution of alignment stats to choose a cutoff for
# maxSkip, maxDrift, maxTrim, minAlignedMarkerCount, and minAlignedFraction
readGraphCreationMethod = 2
config = configparser.ConfigParser()
config.optionxform = lambda option: option
config['Reads'] = {
'minReadLength': str(minReadLength),
'desiredCoverage': str(genomeSize * 60), # 60X coverage
'noCache': 'True',
}
config['Kmers'] = {
'k': '10'
}
config['MinHash'] = {
'minHashIterationCount': '10',
'minBucketSize': '5',
'maxBucketSize': '30',
'minFrequency': '5',
}
config['Align'] = {
'alignMethod': '3',
'downsamplingFactor': '0.05',
'matchScore': '6',
'maxSkip': str(maxSkip),
'maxDrift': str(maxDrift),
'maxTrim': str(maxTrim),
'minAlignedFraction': str(minAlignedFraction),
'minAlignedMarkerCount': str(minAlignedMarkerCount),
'sameChannelReadAlignment.suppressDeltaThreshold': '30',
}
config['ReadGraph'] = {
'creationMethod': str(readGraphCreationMethod),
}
config['MarkerGraph'] = {
'simplifyMaxLength': '10,100,1000,10000,100000',
'refineThreshold': '6',
'crossEdgeCoverageThreshold': '3',
'minCoverage': '0', # Auto select
}
config['Assembly'] = {
'consensusCaller': 'Bayesian:guppy-3.6.0-a'
}
if baseCallerId == 2 or baseCallerId == 3:
# Guppy version less than 3.6.0
config['Reads']['desiredCoverage'] = str(genomeSize * 80) # 80X coverage
if baseCallerId == 2:
config['Assembly']['consensusCaller'] = 'Bayesian:guppy-3.0.5-a'
else:
config['Assembly']['consensusCaller'] = 'Modal'
if enableDetangling:
config['Assembly']['detangleMethod'] = '2'
return config
def main(argv):
genomeSizeMessage = """
What is the approximate genome size in megabases (Mbp)?
Examples:
3000 (for a 3 Gbp genome)
0.4 (for a 400 Kbp genome)
"""
print(genomeSizeMessage)
genomeSize = float(input('Approximate genome size in megabasis (Mbp): '))
techMessage = """
What technology was used to generate the reads?
Enter 1 for Oxford Nanopore (ONT) (default)
Enter 2 for Pac Bio
"""
print(techMessage)
techStr = input('Technology : ')
if techStr == '':
techId = 1
else:
techId = int(techStr)
baseCallerMessage = """
Which basecaller was used to generate reads?
Enter 1 for Guppy version >= 3.6.0 (default)
Enter 2 for Guppy version < 3.6.0
Enter 3 if something other than Guppy was used
"""
if techId == 1: # Only applies to ONT reads.
print(baseCallerMessage)
baseCallerIdStr = input('Basecaller : ')
if baseCallerIdStr == '':
baseCallerId = 1
else:
baseCallerId = int(baseCallerIdStr)
else:
baseCallerId = 3
print()
areUltraLongReads = input('Using Ultra-long reads? [Y/n - default n] : ')
areUltraLongReads = (areUltraLongReads == 'Y' or areUltraLongReads == 'y')
print()
enableDetangling = input('Enable detangling? [Y/n - default Y] : ')
enableDetangling = (enableDetangling == 'Y' or enableDetangling == 'y' or enableDetangling == '')
config = generateConfig(
int(genomeSize * 1000 * 1000),
baseCallerId,
enableDetangling,
areUltraLongReads,
)
configFileName = 'generatedShasta.conf'
with open(configFileName, 'w') as configfile:
config.write(configfile)
print("Shasta configuration written out to {}".format(configFileName))
return
if __name__ == '__main__':
main(sys.argv[1:])
|