Utility functions for processing sequences and subsequences
fasta file utilities
KmerGMA.fasta_id_to_cumulative_len_dict — Functionfasta_id_to_cumulative_len_dict(fasta_file_path::String)function to record and store cumulative lengths of the BEGINNING of each record in a dictionary, from a fasta file. for example: for the following fasta file
>firstseq
ATGC
>secondseq
ATThe dictionary returned would be:
"firstseq" => 4
"secondseq" => 6Not nessecarily in that order.
KmerGMA.getSeq — FunctiongetSeq(sequence::FASTA.Record)Alias to get the dna longsequence of a fasta record
kmer utilities
KmerGMA.kmer_count — Functionkmer_count(str::DnaSeq, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)Simple kmer counting function for a DNA BioSequence str, where k is the kmer length to count. The function returns a kmer frequency vector where each INDEX of the vector corresponds to a unique kmer. For each index, the unsigned bits of the index correspond to a binary representation of sequences where two bits encode one nucleotide. To see what kmer each index corresponds to, see the function as_kmer
KmerGMA.as_kmer — Functionas_kmer(kmer_uint::Integer, kmer_len::Int,Nt_bits::Dict{UInt, Seq} = BitNtDict)Takes a positive integer representing a 2-bit-based dna kmer, and the actual length of the kmer, and returns the BioSequences dna sequence of the kmer. The Nt_bits argument can be ignored. Additionally, note that both input parameters will be modified to 0 or 1.
KmerGMA.as_UInt — FunctionUInt(kmer_seq::LongSequence{DNAAlphabet{4}}, Nt_bits::DnaBits = NUCLEOTIDE_BITS)Convert a dna biosequence kmer_seq into an unsigned integer. Users can ignore the second argument
KmerGMA.kmer_dist — Functionkmer_dist(seq1, seq2, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)returns the kmer distance between two biosequences seq1 and seq2, where k is the kmer length. Users can ignore the last argument.
paired kmer utilities
The following function count all kmer pairs within a sequence, which is a feature to be further utilized in an upcoming release. Counting paired kmers could serve as a more robust alternative to regular kmers in many applications.
KmerGMA.kmer_pair_count — Functionkmer_pair_count(seq::DnaSeq, k::Int = 3, Nt_bits::DnaBits = NUCLEOTIDE_BITS)Counts all paired kmers in a Biosequences DNA longsequence (or a longsubseq), where each index of the resulting paired kmer frequency vector correspond to the first and second kmers appended together.
For example, the 2mer pair AT and GC in a sequence ATGC would have a count of 1 in the resulting vector of length 4^(2*2) = 1 It would be at the index in the vector equivalent to as_UInt(dna"ATGC")
strobemer utilities (experimental)
Strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. This implementation is currently being experimented with and is unoptimized, but may be implemented in the near future as a new homology searching algorithm.
The strobemers implemented are simply randstrobes with two k-mers, and in the future will be a part of a new module StrobemerGMA
KmerGMA.get_strobe_2_mer — Functionget_strobe_2_mer(
seq::LongSequence{DNAAlphabet{4}},
s::Int = 2,
w_min::Int = 3,
w_max::Int = 5,
q::Int = 5;
withGap::Bool = true)Get the randstrobe that consists of two kmers of the current sequence seq. The withGap named argument indicates whether to return the strobemer with sequence gaps or just as a regular kmer with all gaps removed
...
Strobemer parameters
s::Int = 2: length of the kmers that consists of the strobemer.w_min::Int = 3: the start of the window to obtain the second kmer from.w_max::Int = 5: the end of the window to obtain the second kmer from.q::Int = 5: the prime number that the randstrobe hashing function should use.
...
It is recommended to keep all optional strobemer parameters as is
KmerGMA.ungapped_strobe_2_mer_count — Functionungapped_strobe_2_mer_count(
seq::DnaSeq;
s::Int = 2,
w_min::Int = 3,
w_max::Int = 5,
q::Int = 5)Get the ungapped randstrobe (two kmers of the current sequence seq) frequency vector. The vector is $4^{2s}$ long, where each index corresponds to the strobemer with gaps removed. So the strobemer AC--GT-- would correspond to the index corresponding to ACGT. To convert from index to the ungapped strobemer, see KmerGMA.as_kmer
...
Strobemer parameters
s::Int = 2: length of the kmers that consists of the strobemer.w_min::Int = 3: the start of the window to obtain the second kmer from.w_max::Int = 5: the end of the window to obtain the second kmer from.q::Int = 5: the prime number that the randstrobe hashing function should use.
...
It is currently the responsibility of the user to ensure that all parameters do not conflict with eachother
sequence divergence (work in progress)
KmerGMA.mutate_seq — Functionmutate_seq(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)returns a new biosequence where mute_rate portion of the input biosequence seq are substituted to a different base pair. The mut_rate should be between 0 and 1.
KmerGMA.mutate_seq! — Functionmutate_seq!(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)randomly subsitutes mute_rate portion of the input biosequence seq in place.