Searching and Clustering
Distance / Similarity Measure
Indexing and hashing for exact / approximate searching
Data Structure for Exact Search
- kd-tree: binary tree for efficient search Euclidean space
- hierarchical k-means tree
Locality sensitive hashing
- A good tutorial http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
- Also see 4 Pictures that Explain LSH - Locality Sensitive Hashing Tutorial
-
http://www.stat.ucdavis.edu/~chohsieh/teaching/ECS289G_Fall2016/lecture10.pdf
- https://chanzuckerberg.github.io/ExpressionMatrix2/doc/LshSlides-Nov2017.pdf
Approximation for Hamming Distance
- Bit sampling LSH, that is random choose a subset of bits
Approximation for Jaccard Distance
- minhash can be utilized to approximate Jaccard distance between k-mer sets of two large genome, see https://uaauaguga.github.io/jekyll/update/2021/05/15/Alignment-free-method-for-sequence-comparison.html. The same technique can be applied to estimate the similarity between tow document, or large sparse binary vector.
- 我们如果随机取k行出来,近似的不是Jaccard距离,而是Hamming距离
- “min” here means set with smallest hash value
- If we encode k-mer sets with binary table, a minhash function h that maps sets to rows, or a implicit reordering of the table rows
- minhash直接输出的结果称为signature matrix,每一列是一个sample,每一行是不同的哈希值
Approximation for Cosine Distance
- “Signed Random Projections”
Reading
- 2014,Hashing for Similarity Search: A Survey
- 2017,Exact clustering in linear time
- 2021, Genome Research, SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection
- https://chanzuckerberg.github.io/ExpressionMatrix2/doc/LshSlides-Nov2017.pdf
Clustering methods
K-means
- Overview
- Random choose \(K\) point as centroid
- Assign each data point to nearest cetroid
- Update centroid
- Time complexicity: \(O(nK)\), where \(n\) is number of sample points, \(K\) is nnumber of predifined cluster
- Spce complexicity: \(O(n+K)\)
Hierarchical clustering
- Agglomerasive: start from 1 cluster which conatins all sample points
- For each round
- Input is some clusters with pairwise similarity / affinity matrix
- Merge two closest clusters and update affinity matrix
- single linkage
- the proximity of two clusters is the maximum of the proximity between any two points in the two clusters
- complete linkage
- the proximity of two clusters is the minimum of the proximity between any two points in the two clusters
- average linkage alias UPGMA (Unweighted Pair Group Method using Arithmetic average)
- the proximity of two clusters is average of pairwise proximities between sample in two clusters
- WPGMA
- weighted version of UPGMA
- centroid
- the proximity of two clusters is proximity between the centroids of two clusters
- ward
- the proximity is the increase in the squared error that results when two clusters are merged
- hierarchical agglomerative clustering is generally computationally expensive
- for general cases, \(O(n^3)\) in time, \(O(n^2)\) in space
- For each round
- Divisive: start from \(n\) clusters, each contains a sample sample point
DBSCAN
- Automatically determinesthe number of cluster
- Determinsitic
- Time complexicity: \(O(n)\) * time for query points with in a given radius
Types of similarity graph
-
Dense similarity graph
-
\(\epsilon\)-neighborhood graph
-
k-nearest neighbor graph
-
mutual k-nearest neighbor graph
-
Shared Nearest Neighbor (SNN) similarity graph
-
See https://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf
Graph base clustering
-
The relationship between clustering and community detection
-
Graph-based clustering, for instance,maps the task of finding clusters to the task of partitioning a proximity graph into connected components
- https://web.iitd.ac.in/~bspanda/graphclustering.pdf
- Bron–Kerbosch algorithm
- leiden algorithm
Performance evaluation
Clustering in single cell data analysis
- 2019, Nature Review Genetics, Challenges in unsupervised clustering of single-cell RNA-seq data
The relationship between clustering and searching
- Finding nearest neighbors can require computing the pairwise distance between all points. Often clusters and their cluster prototypes can be found much more efficiently.