Utility functions for processing sequences and subsequences
fasta file utilities
KmerGMA.fasta_id_to_cumulative_len_dict
— Functionfasta_id_to_cumulative_len_dict(fasta_file_path::String)
function to record and store cumulative lengths of the BEGINNING of each record in a dictionary, from a fasta file. for example: for the following fasta file
>firstseq
ATGC
>secondseq
AT
The dictionary returned would be:
"firstseq" => 4
"secondseq" => 6
Not nessecarily in that order.
KmerGMA.getSeq
— FunctiongetSeq(sequence::FASTA.Record)
Alias to get the dna longsequence of a fasta record
kmer utilities
KmerGMA.kmer_count
— Functionkmer_count(str::DnaSeq, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)
Simple kmer counting function for a DNA BioSequence str
, where k
is the kmer length to count. The function returns a kmer frequency vector where each INDEX of the vector corresponds to a unique kmer. For each index, the unsigned bits of the index correspond to a binary representation of sequences where two bits encode one nucleotide. To see what kmer each index corresponds to, see the function as_kmer
KmerGMA.as_kmer
— Functionas_kmer(kmer_uint::Integer, kmer_len::Int,Nt_bits::Dict{UInt, Seq} = BitNtDict)
Takes a positive integer representing a 2-bit-based dna kmer, and the actual length of the kmer, and returns the BioSequences dna sequence of the kmer. The Nt_bits argument can be ignored. Additionally, note that both input parameters will be modified to 0 or 1.
KmerGMA.as_UInt
— FunctionUInt(kmer_seq::LongSequence{DNAAlphabet{4}}, Nt_bits::DnaBits = NUCLEOTIDE_BITS)
Convert a dna biosequence kmer_seq
into an unsigned integer. Users can ignore the second argument
KmerGMA.kmer_dist
— Functionkmer_dist(seq1, seq2, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)
returns the kmer distance between two biosequences seq1
and seq2
, where k
is the kmer length. Users can ignore the last argument.
paired kmer utilities
The following function count all kmer pairs within a sequence, which is a feature to be further utilized in an upcoming release. Counting paired kmers could serve as a more robust alternative to regular kmers in many applications.
KmerGMA.kmer_pair_count
— Functionkmer_pair_count(seq::DnaSeq, k::Int = 3, Nt_bits::DnaBits = NUCLEOTIDE_BITS)
Counts all paired kmers in a Biosequences DNA longsequence (or a longsubseq), where each index of the resulting paired kmer frequency vector
correspond to the first and second kmers appended together.
For example, the 2mer pair AT
and GC
in a sequence ATGC
would have a count of 1 in the resulting vector of length 4^(2*2) = 1 It would be at the index in the vector equivalent to as_UInt(dna"ATGC")
strobemer utilities (experimental)
Strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. This implementation is currently being experimented with and is unoptimized, but may be implemented in the near future as a new homology searching algorithm.
The strobemers implemented are simply randstrobes with two k-mers, and in the future will be a part of a new module StrobemerGMA
KmerGMA.get_strobe_2_mer
— Functionget_strobe_2_mer(
seq::LongSequence{DNAAlphabet{4}},
s::Int = 2,
w_min::Int = 3,
w_max::Int = 5,
q::Int = 5;
withGap::Bool = true)
Get the randstrobe that consists of two kmers of the current sequence seq
. The withGap
named argument indicates whether to return the strobemer with sequence gaps or just as a regular kmer with all gaps removed
...
Strobemer parameters
s::Int = 2
: length of the kmers that consists of the strobemer.w_min::Int = 3
: the start of the window to obtain the second kmer from.w_max::Int = 5
: the end of the window to obtain the second kmer from.q::Int = 5
: the prime number that the randstrobe hashing function should use.
...
It is recommended to keep all optional strobemer parameters as is
KmerGMA.ungapped_strobe_2_mer_count
— Functionungapped_strobe_2_mer_count(
seq::DnaSeq;
s::Int = 2,
w_min::Int = 3,
w_max::Int = 5,
q::Int = 5)
Get the ungapped randstrobe (two kmers of the current sequence seq
) frequency vector. The vector is $4^{2s}$ long, where each index corresponds to the strobemer with gaps removed. So the strobemer AC--GT--
would correspond to the index corresponding to ACGT
. To convert from index to the ungapped strobemer, see KmerGMA.as_kmer
...
Strobemer parameters
s::Int = 2
: length of the kmers that consists of the strobemer.w_min::Int = 3
: the start of the window to obtain the second kmer from.w_max::Int = 5
: the end of the window to obtain the second kmer from.q::Int = 5
: the prime number that the randstrobe hashing function should use.
...
It is currently the responsibility of the user to ensure that all parameters do not conflict with eachother
sequence divergence (work in progress)
KmerGMA.mutate_seq
— Functionmutate_seq(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)
returns a new biosequence where mute_rate
portion of the input biosequence seq
are substituted to a different base pair. The mut_rate
should be between 0 and 1.
KmerGMA.mutate_seq!
— Functionmutate_seq!(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)
randomly subsitutes mute_rate
portion of the input biosequence seq
in place.