1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
|
Some notes on the Python API
============================
Some programming guidelines
---------------------------
...
Valid and invalid DNA
~~~~~~~~~~~~~~~~~~~~~
Generally speaking, the Python API for khmer and oxli assume that
they are receiving valid DNA (ACGT). Low-level hash functions like
``hash(kmer)`` and mid-level hash functions like ``hash_kmer_hashes(str)``
will neither check for correct DNA nor is their output with respect to
incorrect DNA characters specified.
However, bulk loading functions provide a ``cleaned_seq`` attribute that
will ... document me here.
Table types
-----------
Type names consist of two parts. The first part indicates how far the type
can count and the second part whether it is a table or a graph.
Possible choices for the first part:
* Node, uses 1bit counter
* SmallCount, uses a 4bit counter
* Count, uses a 8bit counter
Possible choices for the second part:
* Table, keep track of kmers
* Graph, navigate, tag, etc the de Bruijn graph formed by the k-mers
C++ class name:
Python methods:
* k = ksize() - return the k-mer size (a positive integer).
* hashval = hash(dna_kmer) - return the result of hashing ``dna_kmer``, which will be a non-negative integer. ``len(dna_kmer)`` must be exactly the k-mer size. Which hash function is used is dependent on the table type (@document).
* dna_kmer = reverse_hash(hashval) - return a DNA string that will hash to ``hashval``. If there are multiple such strings, return only one. May be unimplemented for particular table types in which case a ValueError will be returned.
* sizelist = hashsizes() - return the list of table sizes used in construction.
* n = n_unique_kmers() - retrieve an estimate for the number of unique k-mers inserted into the table. Note, this may be order dependent.
* n = n_occupied() - retrieve the fraction of bins occupied in the table.
* add(dna_kmer_or_hashval) - increment the count associated with either a DNA k-mer or a hashval. Depending on max count for the tabletype and bigcount settings, the count may top out at 1, 15, 255, or 65535. (@CTB add method for retrieving max_count)
* count (synonym for add)
* get(dna_kmer or hashval) - retrieve the count associated with a DNA k-mer or a hashval.
* list_of_strings = get_kmers(seq) - return the list of k-mer strings in the given sequence.
* list_of_hashes = get_kmer_hashes(seq) - return the list of the hashed k-mers in the given sequence.
* hashset = get_kmer_hashes_as_hashset(seq) - return the hashset of hashed k-mers in the given sequence.
* list_of_counts = get_kmer_counts(seq) - return the list of the counts of the k-mers in the given sequence.
* save(filename) - save the data to a file on disk.
* load(filename) - load the data from a file on disk.
* min_count = get_min_count(seq) - return the minimum count for k-mers in seq
* med_count, avg_count, stddev_count = get_median_count(seq) - return the median, average, and stddev of the counts for k-mers in the sequence.
* max_count = get_max_count(seq) - return the maximum count for k-mers in the sequence.
* num_kmers = consume(seq) - count all the k-mers in the given DNA string. @CTB should be consume string
* consume_fasta(filename) - count all the k-mers in a (DNA) FASTA/FASTQ file.
* consume_fasta_with_reads_parser(khmer.ReadParser object) - count all the k-mers in (DNA) sequences returned from a ReadParser object. This can be used to ensure various forms of pairing are present, etc.
* (trim_seq, trim_pos) = trim_on_abundance(seq, abund) - trim the sequence at the first k-mer with an abundance strictly below (<) the provided abundance.
* (trim_seq, trim_pos) = trim_below_abundance(seq, abund) - trim the sequence at the first k-mer with an abundance strictly above (>) the provided abundance.
* list_of_posns = find_spectral_error_positions(seq, abund) - find potential locations of errors in input sequence, where k-mers are below given abundance.
* set_use_bigcount(bool) - some table types (Counttable and Countgraph) support counting past their max value, using a (memory intensive) C++ stl::map. That turns on this "bigcount" behavior. This will raise a ValueError when called on a table type that does not support it.
* get_use_bigcount - return the bigcount value (False by default).
* dist = abundance_distribution(filename, tracking_obj) - generate an abundance distribution for the k-mers in the given file, using the tracking_obj to avoid double-counting identical k-mers.
* dist = abundance_distribution_with_reads_parser(readparser_obj, tracking_obj) - generate an abundance distribution for the k-mers loaded from the given readparser, using the tracking_obj to avoid double-counting identical k-mers.
Graph types
-----------
All the methods of table types, and in addition:
* neighbors
* calc_connected_graph_size
* kmer_degree
* count_kmers_within_radius
* find_high_degree_nodes
* traverse_linear_path
* assemble_linear_path
* consume_and_tag
* get_tags_and_positions
* find_all_tags_list
* consume_fasta_and_tag
* extract_unique_paths
* print_tagset
* add_tag
* get_tagset
* load_tagset
* save_tagset
* n_tags
* divide_tags_into_subsets
* _get_tag_density
* _set_tag_density
* do_subset_partition
* find_all_tags
* assign_partition_id
* output_partitions
* load_partitionmap
* save_partitionmap
* _validate_partitionmap
* consume_fasta_and_tag_with_reads_parser
* consume_partitioned_fasta
* merge_subset
* merge_subset_from_disk
* count_partitions
* subset_count_partitions
* subset_partition_size_distribution
* save_subset_partitionmap
* load_subset_partitionmap
* _validate_subset_partitionmap
* set_partition_id
* join_partitions
* get_partition_id
* repartition_latest_partition
* load_stop_tags
* save_stop_tags
* print_stop_tags
* trim_on_stoptags
* add_stop_tags
* get_stop_tags
Smallcountgraph:
* get_raw_tables
Countgraph:
* get_raw_tables
* do_subset_partition_with_abundance
Nodegraph:
* update
* get_raw_tables
|