Utility functions for processing sequences and subsequences

fasta file utilities

KmerGMA.fasta_id_to_cumulative_len_dict — Function

fasta_id_to_cumulative_len_dict(fasta_file_path::String)

function to record and store cumulative lengths of the BEGINNING of each record in a dictionary, from a fasta file. for example: for the following fasta file

>firstseq
ATGC
>secondseq
AT

The dictionary returned would be:

"firstseq"  => 4
"secondseq" => 6

Not nessecarily in that order.

source

KmerGMA.getSeq — Function

getSeq(sequence::FASTA.Record)

Alias to get the dna longsequence of a fasta record

source

kmer utilities

KmerGMA.kmer_count — Function

kmer_count(str::DnaSeq, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

Simple kmer counting function for a DNA BioSequence str, where k is the kmer length to count. The function returns a kmer frequency vector where each INDEX of the vector corresponds to a unique kmer. For each index, the unsigned bits of the index correspond to a binary representation of sequences where two bits encode one nucleotide. To see what kmer each index corresponds to, see the function as_kmer

source

KmerGMA.as_kmer — Function

as_kmer(kmer_uint::Integer, kmer_len::Int,Nt_bits::Dict{UInt, Seq} = BitNtDict)

Takes a positive integer representing a 2-bit-based dna kmer, and the actual length of the kmer, and returns the BioSequences dna sequence of the kmer. The Nt_bits argument can be ignored. Additionally, note that both input parameters will be modified to 0 or 1.

source

KmerGMA.as_UInt — Function

UInt(kmer_seq::LongSequence{DNAAlphabet{4}}, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

Convert a dna biosequence kmer_seq into an unsigned integer. Users can ignore the second argument

source

KmerGMA.kmer_dist — Function

kmer_dist(seq1, seq2, k::Int, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

returns the kmer distance between two biosequences seq1 and seq2, where k is the kmer length. Users can ignore the last argument.

source

paired kmer utilities

The following function count all kmer pairs within a sequence, which is a feature to be further utilized in an upcoming release. Counting paired kmers could serve as a more robust alternative to regular kmers in many applications.

KmerGMA.kmer_pair_count — Function

kmer_pair_count(seq::DnaSeq, k::Int = 3, Nt_bits::DnaBits = NUCLEOTIDE_BITS)

Counts all paired kmers in a Biosequences DNA longsequence (or a longsubseq), where each index of the resulting paired kmer frequency vector correspond to the first and second kmers appended together.

For example, the 2mer pair AT and GC in a sequence ATGC would have a count of 1 in the resulting vector of length 4^(2*2) = 1 It would be at the index in the vector equivalent to as_UInt(dna"ATGC")

source

strobemer utilities (experimental)

Strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. This implementation is currently being experimented with and is unoptimized, but may be implemented in the near future as a new homology searching algorithm.

The strobemers implemented are simply randstrobes with two k-mers, and in the future will be a part of a new module StrobemerGMA

KmerGMA.get_strobe_2_mer — Function

get_strobe_2_mer(
    seq::LongSequence{DNAAlphabet{4}},
    s::Int = 2,
    w_min::Int = 3,
    w_max::Int = 5,
    q::Int = 5;
    withGap::Bool = true)

Get the randstrobe that consists of two kmers of the current sequence seq. The withGap named argument indicates whether to return the strobemer with sequence gaps or just as a regular kmer with all gaps removed

...

Strobemer parameters

s::Int = 2: length of the kmers that consists of the strobemer.
w_min::Int = 3: the start of the window to obtain the second kmer from.
w_max::Int = 5: the end of the window to obtain the second kmer from.
q::Int = 5: the prime number that the randstrobe hashing function should use.

...

It is recommended to keep all optional strobemer parameters as is

source

KmerGMA.ungapped_strobe_2_mer_count — Function

ungapped_strobe_2_mer_count(
    seq::DnaSeq;
    s::Int = 2,
    w_min::Int = 3,
    w_max::Int = 5,
    q::Int = 5)

Get the ungapped randstrobe (two kmers of the current sequence seq) frequency vector. The vector is $4^{2s}$ long, where each index corresponds to the strobemer with gaps removed. So the strobemer AC--GT-- would correspond to the index corresponding to ACGT. To convert from index to the ungapped strobemer, see KmerGMA.as_kmer

...

Strobemer parameters

s::Int = 2: length of the kmers that consists of the strobemer.
w_min::Int = 3: the start of the window to obtain the second kmer from.
w_max::Int = 5: the end of the window to obtain the second kmer from.
q::Int = 5: the prime number that the randstrobe hashing function should use.

...

It is currently the responsibility of the user to ensure that all parameters do not conflict with eachother

source

sequence divergence (work in progress)

KmerGMA.mutate_seq — Function

mutate_seq(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)

returns a new biosequence where mute_rate portion of the input biosequence seq are substituted to a different base pair. The mut_rate should be between 0 and 1.

source

KmerGMA.mutate_seq! — Function

mutate_seq!(seq::LongSequence{DNAAlphabet{4}}, mut_rate::Real)

randomly subsitutes mute_rate portion of the input biosequence seq in place.

source