Alignment Free Sequence Comparison
Alignment free method for large scale genomic sequence analysis
Approximate jaccard index calculation
- minhash
    
- A instance for locality sensitive hashing that approximately preserve jaccard distance. (There are also some LSH that preserve hamming distance, etc)
 - k-minimum values (KMVs) sketching is a widely used variant of minhash
        
- We have two genome
 - We have a hash function
 - For each genome, we calculate the hash value of every q-gram, take k smallest hash values
 - The overlap bewteen two set of hash value approximate jaccard distance beween all k-mers in two genomes
 
 - Useful tools
        
- mash
 - sourmash
 - The 
sketch.shutility in BBMap, also see this post https://www.biostars.org/p/234837/ 
 
 - Some applications
    
- 2015, NBT, Assembling large genomes with single-moleculesequencing and locality-sensitive hashing Use min-hash to define anchor between noisy long reads for sequence assembly
 - 2020,Genome Biology,Metalign: efficient alignment-based metagenomic profiling via containment min hash For metagenomic profiling, use min-hash to reduce the size of database to perform sequence alignment with minimap2
 
 
k-mer sampling
- Counting / indexing all k-mers in a genome requires large memory
 - 
    
We may sample some representative k-mers
 - minimizer / winnowing
    
- sample 1 k-mer in w consecutive k-mers
 - For two substring of length w-k+1, if their sequence is identical, the same k-mer should be sampled
 - So in addition to specifying w and k, priority of all possible k-mer should be specified, and k-mer with hihgest priority is took
 - This strategy for k-mer sampling is originially proposed in 2004, Bioinformatics, Reducing storage requirements for biological sequence comparison k-mer sampled by this strategy is called minimizer.
 - k-mer sampling has very wide applications in sequence comparisons
        
- minimap, kraken, …
 
 - further reading
        
- https://www.cs.cmu.edu/~gmarcais/slides/PAG2019.pdf
 - https://homolog.us/blogs/bioinfo/2017/10/25/intro-minimizer/
 - https://homolog.us/blogs/bioinfo/2014/04/06/2014-the-year-of-minimizers-in-bioinformatics/
 - https://genomeinformatics.github.io/mashmap/
 - https://academic.oup.com/bioinformatics/article/37/17/2563/6162158?login=true
 - 2017, Plos Computational Biology, Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
 
 
 - Select m k-mers with smallest hash values
    
- Same as using m k-mers that corresponds to minhash sketches
 - See 2018, Nature Communication, Clustering huge protein sequence sets in linear time
 - extract kmers such that the same k-mers tend to be extracted from homologous sequences.
 - avoid positional clustering of selected k-mers in order to be sensitive to detect local homologies in every region of a sequence
 - we would like to extract k-mers that tend to be conserved between homologous sequences
 - A MinHash sketch of size 1 is equivalent to minimizer, see the mash paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x
 
 - 不同长度序列的minhash signature的大小是固定的,而minimizer的数量和序列的长度是正相关的
 
LSH in other fields
- In addition to estimate Jaccard distance with minhash, we could use other hash function or other distance estimation
 - See http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
 
Approximate member query and efficient counting
- Bloom filter
    
- Query whether an element exists in a large set. May generate false positive, but never false negative.
 - An alternative to the memory intensive hash table
 - A m bits array, K hash functions
 - For a new instance, each hash function map input to one of the positions in 1..m
 
 - Some variant
    
- counting bloom filter
 - spectral bloom filter
 
 - Quotient filter
 
Reading List
- 2019, Annual Review of Biomedical Data Science, Sketching and Sublinear Data Structures in Genomics
 - 2019, Genome Biology, When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data