Alignment Free Sequence Comparison

Alignment free method for large scale genomic sequence analysis

minhash
- A instance for locality sensitive hashing that approximately preserve jaccard distance. (There are also some LSH that preserve hamming distance, etc)
- k-minimum values (KMVs) sketching is a widely used variant of minhash
  - We have two genome
  - We have a hash function
  - For each genome, we calculate the hash value of every q-gram, take k smallest hash values
  - The overlap bewteen two set of hash value approximate jaccard distance beween all k-mers in two genomes
- Useful tools
  - mash
  - sourmash
  - The sketch.sh utility in BBMap, also see this post https://www.biostars.org/p/234837/
Some applications
- 2015, NBT, Assembling large genomes with single-moleculesequencing and locality-sensitive hashing Use min-hash to define anchor between noisy long reads for sequence assembly
- 2020,Genome Biology,Metalign: efficient alignment-based metagenomic profiling via containment min hash For metagenomic profiling, use min-hash to reduce the size of database to perform sequence alignment with minimap2

In addition to estimate Jaccard distance with minhash, we could use other hash function or other distance estimation
See http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

Bloom filter
- Query whether an element exists in a large set. May generate false positive, but never false negative.
- An alternative to the memory intensive hash table
- A m bits array, K hash functions
- For a new instance, each hash function map input to one of the positions in 1..m
Some variant
- counting bloom filter
- spectral bloom filter
Quotient filter