Searching and Clustering

Distance / Similarity Measure

minhash can be utilized to approximate Jaccard distance between k-mer sets of two large genome, see https://uaauaguga.github.io/jekyll/update/2021/05/15/Alignment-free-method-for-sequence-comparison.html. The same technique can be applied to estimate the similarity between tow document, or large sparse binary vector.
我们如果随机取k行出来，近似的不是Jaccard距离，而是Hamming距离
“min” here means set with smallest hash value
If we encode k-mer sets with binary table, a minhash function h that maps sets to rows, or a implicit reordering of the table rows
minhash直接输出的结果称为signature matrix,每一列是一个sample，每一行是不同的哈希值

Overview
- Random choose \(K\) point as centroid
- Assign each data point to nearest cetroid
- Update centroid
Time complexicity: \(O(nK)\), where \(n\) is number of sample points, \(K\) is nnumber of predifined cluster
Spce complexicity: \(O(n+K)\)

Dense similarity graph
\(\epsilon\)-neighborhood graph
k-nearest neighbor graph
mutual k-nearest neighbor graph
Shared Nearest Neighbor (SNN) similarity graph
See https://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf

The relationship between clustering and community detection
Graph-based clustering, for instance,maps the task of finding clusters to the task of partitioning a proximity graph into connected components
https://web.iitd.ac.in/~bspanda/graphclustering.pdf
https://www.csc2.ncsu.edu/faculty/nfsamato/practical-graph-mining-with-R/slides/pdf/Graph_Cluster_Analysis.pdf
Bron–Kerbosch algorithm
leiden algorithm

Finding nearest neighbors can require computing the pairwise distance between all points. Often clusters and their cluster prototypes can be found much more efficiently.