Utility functions for processing sequences and subsequences

fasta file utilities

KmerGMA.fasta_id_to_cumulative_len_dictFunction
fasta_id_to_cumulative_len_dict(fasta_file_path::String)

function to record and store cumulative lengths of the BEGINNING of each record in a dictionary, from a fasta file. for example: for the following fasta file

>firstseq
ATGC
>secondseq
AT

The dictionary returned would be:

"firstseq"  => 4
"secondseq" => 6

Not nessecarily in that order.

source
KmerGMA.getSeqFunction
getSeq(sequence::FASTA.Record)

Alias to get the dna longsequence of a fasta record

source

kmer utilities

KmerGMA.kmer_countFunction
kmer_count(str::DnaSeq, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

Simple kmer counting function for a DNA BioSequence str, where k is the kmer length to count. The function returns a kmer frequency vector where each INDEX of the vector corresponds to a unique kmer. For each index, the unsigned bits of the index correspond to a binary representation of sequences where two bits encode one nucleotide. To see what kmer each index corresponds to, see the function as_kmer

source
KmerGMA.as_kmerFunction
as_kmer(kmer_uint::Integer, kmer_len::Int,Nt_bits::Dict{UInt, Seq} = BitNtDict)

Takes a positive integer representing a 2-bit-based dna kmer, and the actual length of the kmer, and returns the BioSequences dna sequence of the kmer. The Nt_bits argument can be ignored. Additionally, note that both input parameters will be modified to 0 or 1.

source
KmerGMA.as_UIntFunction
UInt(kmer_seq::LongSequence{DNAAlphabet{4}}, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

Convert a dna biosequence kmer_seq into an unsigned integer. Users can ignore the second argument

source
KmerGMA.kmer_distFunction
kmer_dist(seq1, seq2, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

returns the kmer distance between two biosequences seq1 and seq2, where k is the kmer length. Users can ignore the last argument.

source

paired kmer utilities

The following function count all kmer pairs within a sequence, which is a feature to be further utilized in an upcoming release. Counting paired kmers could serve as a more robust alternative to regular kmers in many applications.

KmerGMA.kmer_pair_countFunction
kmer_pair_count(seq::DnaSeq, k::Int = 3, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

Counts all paired kmers in a Biosequences DNA longsequence (or a longsubseq), where each index of the resulting paired kmer frequency vector correspond to the first and second kmers appended together.

For example, the 2mer pair AT and GC in a sequence ATGC would have a count of 1 in the resulting vector of length 4^(2*2) = 1 It would be at the index in the vector equivalent to as_UInt(dna"ATGC")

source

strobemer utilities (experimental)

Strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. This implementation is currently being experimented with and is unoptimized, but may be implemented in the near future as a new homology searching algorithm.

The strobemers implemented are simply randstrobes with two k-mers, and in the future will be a part of a new module StrobemerGMA

KmerGMA.get_strobe_2_merFunction
get_strobe_2_mer(
    seq::LongSequence{DNAAlphabet{4}},
    s::Int = 2,
    w_min::Int = 3,
    w_max::Int = 5,
    q::Int = 5;
    withGap::Bool = true)

Get the randstrobe that consists of two kmers of the current sequence seq. The withGap named argument indicates whether to return the strobemer with sequence gaps or just as a regular kmer with all gaps removed

...

Strobemer parameters

  • s::Int = 2: length of the kmers that consists of the strobemer.
  • w_min::Int = 3: the start of the window to obtain the second kmer from.
  • w_max::Int = 5: the end of the window to obtain the second kmer from.
  • q::Int = 5: the prime number that the randstrobe hashing function should use.

...

It is recommended to keep all optional strobemer parameters as is

source
KmerGMA.ungapped_strobe_2_mer_countFunction
ungapped_strobe_2_mer_count(
    seq::DnaSeq;
    s::Int = 2,
    w_min::Int = 3,
    w_max::Int = 5,
    q::Int = 5)

Get the ungapped randstrobe (two kmers of the current sequence seq) frequency vector. The vector is $4^{2s}$ long, where each index corresponds to the strobemer with gaps removed. So the strobemer AC--GT-- would correspond to the index corresponding to ACGT. To convert from index to the ungapped strobemer, see KmerGMA.as_kmer

...

Strobemer parameters

  • s::Int = 2: length of the kmers that consists of the strobemer.
  • w_min::Int = 3: the start of the window to obtain the second kmer from.
  • w_max::Int = 5: the end of the window to obtain the second kmer from.
  • q::Int = 5: the prime number that the randstrobe hashing function should use.

...

It is currently the responsibility of the user to ensure that all parameters do not conflict with eachother

source

sequence divergence (work in progress)

KmerGMA.mutate_seqFunction
mutate_seq(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)

returns a new biosequence where mute_rate portion of the input biosequence seq are substituted to a different base pair. The mut_rate should be between 0 and 1.

source
KmerGMA.mutate_seq!Function
mutate_seq!(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)

randomly subsitutes mute_rate portion of the input biosequence seq in place.

source