TMM normalization in python
- TMM method in edgeR package is a popular method for RNA seq count matrix normalization, here I implement a python version
Brief description
- consider n gene, m sample in a (n,m) matrix
- take upper 75% quantile of each sample in expression matrix, divide it by library size, called f75 (m,)
- take sample nearest to mean as reference, you can set any sample as reference
- calculate TMM by compare each sample to the reference sample
- define: gene-wise log-fold-changes, also called log ratio of expression
- define: absolute expression level
- define: estimated asymptotic variance
-
Keep expression in quantile [0.3,0.7] and relative expression in quantile [0.3,0.7]
-
calculate average of \(M_g\), weighted by \(1/V_g\)
- The TMM factor is \(2^f\)
A python implementation
import pandas as pd
import numpy as np
# Get TMM factor of a sample, given the reference sample
def get_TMM_factor(count_observed,count_reference,logratioTrim=0.3,sumTrim=0.05,Acutoff=-1e10):
assert count_observed.shape[0] == count_reference.shape[0]
# Mask genes with zero count in at least one of two samples
mask = (count_observed>0)&(count_reference>0)
count_observed = count_observed[mask].copy()
count_reference = count_reference[mask].copy()
observed_size = count_observed.sum()
reference_size = count_reference.sum()
# Calculate log fold change
M = np.log2(count_observed/observed_size) - np.log2(count_reference/reference_size)
if(M.abs().max()) < 1e-6:
return 1
# Calculate absolute expression
A = (np.log2(count_observed/observed_size) + np.log2(count_reference/reference_size))/2
V = 1/count_observed - 1/observed_size + 1/count_reference -1/reference_size
mask = (M < np.inf) & (A<np.inf) & (A>-1e10)
M, A = M[mask], A[mask]
# Get genes that is believed to not be differentially expressed in two sample
# And genes which expression is not too high or too low
n = M.shape[0]
M_l, A_l = int(logratioTrim*n), int(sumTrim*n)
M_h, A_h = n - M_l,n - A_l
M_rank,A_rank = M.rank(),A.rank()
genes_selected = M_rank[(M_rank>=M_l) & (M_rank<=M_h) & (A_rank>=A_l) & (A_rank<=A_h)].index
M,V = M.loc[genes_selected],V.loc[genes_selected]
f = np.dot(1/V.values,M.values)/(1/V.values).sum()
return 2**f
# Get TMM scaling factor for a expression matrix
def get_scaling_factor(expression_matrix,reference):
scaling_factors = {}
for col in expression_matrix.columns:
f = get_TMM_factor(expression_matrix[col],expression_matrix[reference])
scaling_factors[col] = f
scaling_factors = pd.Series(scaling_factors)
scaling_factors = scaling_factors/np.exp(np.log(scaling_factors).mean())
return scaling_factors.loc[expression_matrix.columns]
# example usage
def example():
expression_matrix = pd.read_csv("TCGA-CRC-counts.txt",sep="\t",index_col=0)
expression_matrix = expression_matrix[expression_matrix.sum(axis=1)>0]
scaling_factor = get_scaling_factor(expression_matrix,"TCGA-A6-5662-11A")
Reference
- TMM’s publication: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25
- Source code in edgeR: https://rdrr.io/bioc/edgeR/src/R/calcNormFactors.R