Homology searching

Here is the main function of the package. Based on a novel kmer-based algorithm to reduce runtime over a whole genome to be O(#bp) on average.

Note

The documentation is unfinished and may be somewhat unclear. For more details on how the algorithm functions, some pre-print of the homology searching approach will be uploaded soon.

The algorithm intakes a set of similar reference sequences - for example a collection of V-genes - and iterates over a genome assembly(ies) to find genes that are homologous to the references.

KmerGMA.findGenesFunction
KmerGMA.findGenes(;
    genome_path::String,
    ref_path::String,
    k::Int = 6,
    KmerDistThr::Union{Int64, Float64} = 0,
    buffer::Int64 = 50,
    do_align::Bool = true,
    gap_open_score::Int = -69,
    gap_extend_score::Int = -1,
    do_return_dists::Bool = false,
    do_return_hit_loci::Bool = false,
    do_return_align::Bool = false
    verbose::Bool = true)

The main API to conduct homology searching in a genome, using a kmer-based sequence similarity metric, against a a reference sequence set. For example, the set of all germline V genes of a mammal. Returns the approximate matches as FASTA record vector WITHIN A VECTOR of length 1 in the default configuration of the parameters. The descriptions of the record contain information about the match. The format of the description appears as so:

"Identifier | dist = a | MatchPos = b:c | GenomePos = e | Len = f"

Where Identifier is the contig ID of the genome where the current hit was found. dist is the kmer similarity, and MatchPos is the unitrange for the match in the contig. GenomePos is the cumulative nucleotides that have been iterated over until the current contig, and Len is simply the length of the hit length.

...

Arguments

  • genome_path: a string that should be the path of the genome to conduct homology searching on.
  • ref_path: the file location of a fasta file containing the reference sequence(s). The references should be very similar in length.

Optional Arguments

  • k::Int64 = 6: the kmer length to use for approximate matching. Generally it should probably remain between 5 - 10 and probably does not have a profound impact on how the matches are found. Check out the pre-print for more information
  • KmerDistThr::Int64 = 0: the Kmer Distance distance threshold for sequence matches to the query reference sequence set. Out of the context of the algorithm, lower values mean matches have to be more similar to the references. If left as 0, it is automatically computed. Once again the pre-print has information on this argument.
  • buffer::Int64 = 50: the amount of nucleotides left and right of a matched sequence that should be added to the returned fasta sequences, as KmerGMA is a heuristic. If do_align is set to true the buffer will be included for a more accurate alignment.
  • verbose::Bool = true Indicates whether to show info in the REPL about the the progress of the processing
  • do_align: Whether to align the hits+buffer region to the consensus sequence of the references. Highly recommended to keep as true
  • gap_open_score::Int = -69: The gap_open scoring paramater for the BioAlignments semi global pairwise aligner function. The number should ideally be kept low to heavily penalize gap extensions for most use-cases.
  • gap_extend_score::Int = -1: The gap_extend scoring parameter for the BioAlignments semi global pairwise aligner function. Can be kept to the default -1 as long as the gap_open score is low.
  • do_return_dists: boolean to indicate whether the kmer distances along every window along the genome should be returned in a vector. (intensive memory consumption when genomes are large)
  • do_return_hit_loci: if true, will return an additional vector of the position within the genomic sequences of each hit, corresponding to the index in the hit vector.
  • do_return_align: if true, will return an additional vector of alignment object of each hit to the consensus reference sequence.
  • KmerDist_threshold_buffer::Real = 8.0: a value to determine approximately how much kmer distance a hit should be lower than any random non-hit sequence. Keep in mind that kmerdistance approximates edit distance for mutations better than indels.

...

The last three arguments would add term to the output. The output vector would incorporate the respective vectors in the same order of priority if any of the parameters are true.

Note: Playing with the KmerDistThr argument could return more or less matches every time.

source

An alternative version of KmerGMA.findGenes which clusters the reference sequence set into similar subsets is also avaliable, though its running speed would be "the number of subsets" times slower. However, the following function is recommended for when speed is less important than accuracy.

KmerGMA.findGenes_cluster_modeFunction
KmerGMA.findGenes_cluster_mode(;
    genome_path::String,
    ref_path::String,
    cluster_cutoffs = [7,12,20,25],
    k::Int = 6, 
    KmerDistThrs = Float64[0], 
    buff::Int64 = 50,
    do_align::Bool = true,
    gap_open_score::Int = -200,
    gap_extend_score::Int = -1,
    do_return_align::Bool = false,
    do_return_dists::Bool = false,
    do_return_hit_loci::Bool = false,
    verbose::Bool = true,
    kmerDist_threshold_buffer::Real = 7
    )

A slower (O(mn)) alternative to KmerGMA.findGenes to conduct homology searching in a genome, against a a reference sequence set, using a kmer-based sequence similarity metric. For example, the set of all germline V genes of a mammal.

Returns the approximate matches as FASTA record vector WITHIN A VECTOR of length 1 in the default configuration of the parameters. The descriptions of the record contain information about the match. The format of the description appears as so:

"Identifier | dist = a | KFV = b | MatchPos = c:d | GenomePos = e | Len = f"

Where Identifier is the contig ID of the genome where the current hit was found. dist is the kmer similarity, KFV is the reference kmer frequency vector that a hit was matched against (more info below), MatchPos is the unitrange for the match in the contig. GenomePos is the cumulative nucleotides that have been iterated over until the current contig, and Len is simply the length of the hit sequence

...

Arguments

  • genome_path: a string that should be the path of the genome to conduct homology searching on.
  • ref_path: the file location of a fasta file containing the reference sequence(s). The references should be very similar in length.

Optional Arguments

  • cluster_cutoffs = [7,12,20,25]: Important determination of the cutoff points to construct subsets of similar reference sequences. The cutoff points refer to the kmer distance against the reference. Usually, the more cutoffs may be better but the default cutoffs have been determined to be relatively robust.
  • k::Int64 = 6: the kmer length to use for approximate matching. Generally it should probably remain between 5 - 10 and probably does not have a profound impact on how the matches are found. Check out the pre-print for more information
  • KmerDistThrs = [0]: the Kmer Distance distance thresholds for each sequence matche to the query reference sequence set. Out of the context of the algorithm, lower values mean matches have to be more similar to the references. If left as 0, it is automatically computed. Once again a pre-print has information on this argument.
  • buffer::Int64 = 100: the amount of nucleotides left and right of a matched sequence that should be added to the returned fasta sequences, as KmerGMA is a heuristic. If do_align is set to true the buffer will be included for a more accurate alignment.
  • do_align: Whether to align the hits+bufer region to the consensus sequence of the references. Highly recommended to keep as true
  • gap_open_score::Int = -200: The gap_open scoring paramater for the BioAlignments semi global pairwise aligner function. The number should ideally be kept low to heavily penalize gap extensions for most use-cases.
  • gap_extend_score::Int = -1: The gap_extend scoring parameter for the BioAlignments semi global pairwise aligner function. Can be kept to the default -1 as long as the gap_open score is low.
  • do_return_align: if true, will return an additional vector of alignment object of each hit to the consensus reference sequence.
  • do_return_dists: boolean to indicate whether the kmer distances along every window along the genome should be returned in a vector. (intensive memory consumption when genomes are large)
  • do_return_hit_loci: if true, will return an additional vector of the position within the genomic sequences of each hit, corresponding to the index in the hit vector.
  • verbose::Bool = true Indicates whether to show info in the REPL about the the progress of the processing
  • kmerDist_threshold_buffer::Real = 7: a value to determine approximately how much kmer distance a hit should be lower than any random non-hit sequence. Keep in mind that kmerdistance approximates edit distance for mutations better than indels. Additionally, the false positive rate would likely increase as the value goes too low. Unfortunately this value is currently only optimized for k = 6. something around 20 seem to work well for k = 5.

...

The last three arguments would add terms to the output. When verbose is true the exact outputs are even stated. The output vector would incorporate the respective vectors in the same order of priority if any of the parameters are true.

There are some more details to be added in future releases.

source

To write resulting vector of hits into a fasta file, use the following function

KmerGMA.write_resultsFunction
write_results(KmerGMA_result_vec::Vector{FASTX.FASTA.Record}, file_path::String, width::Int64 = 95)

Writes all FASTA Records (from the FASTX package) from a vector into a fasta file location file_path. width is an optional argument that indicates the maximum width of sequences written to the file per line.

source

There is also possible pre-processing of the set of reference sequences that can be used in kmerGMA.findGenes_cluster_mode to further improve accuracy (though usually at the cost of a bit more more speed) More on this will be expanded on.

Experimental Strobemer-based homology searching

Note

The following function also does homology searching with strobemers but is not user-friendly and lacking more documentation and testing.

KmerGMA.Strobemer_findGenesFunction
Strobemer_findGenes(;
    genome_path::String,
    ref_path::String,
    s::Int = 2,
    w_min::Int = 3,
    w_max::Int = 5,
    q::Int = 5,
    KmerDistThr::Union{Int64, Float64} = 30,
    buffer::Int64 = 50,
    do_align::Bool = true,
    do_return_dists::Bool = false,
    do_return_hit_loci::Bool = false,
    do_return_align::Bool = false,
    verbose::Bool = true)

An experimental homology searcher that uses Strobemers, specifically randstrobes with 2 sub-kmers. Very unoptimized and there is not distance threshold estimation yet. More documentation is to come.

The function uses the same arguments and returns the same outputs as KmerGMA.findGenes and KmerGMA.findGenes_cluster_mode but s, w_min, w_max, q are randstrobe parameters.

source