Distance / Similarity Measure

Indexing and hashing for exact / approximate searching

  • kd-tree: binary tree for efficient search Euclidean space
  • hierarchical k-means tree

Locality sensitive hashing

Approximation for Hamming Distance

  • Bit sampling LSH, that is random choose a subset of bits

Approximation for Jaccard Distance

  • minhash can be utilized to approximate Jaccard distance between k-mer sets of two large genome, see https://uaauaguga.github.io/jekyll/update/2021/05/15/Alignment-free-method-for-sequence-comparison.html. The same technique can be applied to estimate the similarity between tow document, or large sparse binary vector.
  • 我们如果随机取k行出来,近似的不是Jaccard距离,而是Hamming距离
  • “min” here means set with smallest hash value
  • If we encode k-mer sets with binary table, a minhash function h that maps sets to rows, or a implicit reordering of the table rows
  • minhash直接输出的结果称为signature matrix,每一列是一个sample,每一行是不同的哈希值

Approximation for Cosine Distance

  • “Signed Random Projections”

Reading

Clustering methods

K-means

  • Overview
    • Random choose \(K\) point as centroid
    • Assign each data point to nearest cetroid
    • Update centroid
  • Time complexicity: \(O(nK)\), where \(n\) is number of sample points, \(K\) is nnumber of predifined cluster
  • Spce complexicity: \(O(n+K)\)

Hierarchical clustering

  • Agglomerasive: start from 1 cluster which conatins all sample points
    • For each round
      • Input is some clusters with pairwise similarity / affinity matrix
      • Merge two closest clusters and update affinity matrix
    • single linkage
      • the proximity of two clusters is the maximum of the proximity between any two points in the two clusters
    • complete linkage
      • the proximity of two clusters is the minimum of the proximity between any two points in the two clusters
    • average linkage alias UPGMA (Unweighted Pair Group Method using Arithmetic average)
      • the proximity of two clusters is average of pairwise proximities between sample in two clusters
    • WPGMA
      • weighted version of UPGMA
    • centroid
      • the proximity of two clusters is proximity between the centroids of two clusters
    • ward
      • the proximity is the increase in the squared error that results when two clusters are merged
    • hierarchical agglomerative clustering is generally computationally expensive
      • for general cases, \(O(n^3)\) in time, \(O(n^2)\) in space
  • Divisive: start from \(n\) clusters, each contains a sample sample point

DBSCAN

  • Automatically determinesthe number of cluster
  • Determinsitic
  • Time complexicity: \(O(n)\) * time for query points with in a given radius

Types of similarity graph

Graph base clustering

Performance evaluation

Clustering in single cell data analysis

The relationship between clustering and searching

  • Finding nearest neighbors can require computing the pairwise distance between all points. Often clusters and their cluster prototypes can be found much more efficiently.