Notes for deep learning in biological sequence analysis

Jun 25, 2021

A collection of studies that applies neural network to learn representations for biological sequence

2017, Cell Systems, Enhancing Evolutionary Couplings with DeepConvolutional Neural Networks
2019, Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
2019, Nature Method, Unified rational protein engineering with sequence-based deep representation learning
- MLSTM, minimize next amino-acid prediction cross entropy loss, use fixed size hidden states as feature representation
- One hidden state activity for each amino acid
- Average the activity cross all AAs in the full length sequence to get representation for the protein
- Another choice is use the activity of the last hidden state
- doc2vec: https://github.com/fhalab/embeddings_reproduction
- https://github.com/churchlab/UniRep
2018, NeurIPS, Neural Edit Operations for Biological Sequences
- Replace argmax with softmax in sequence alignment, to make the sequence alignment loss differentiable
- Related works:
  - Differentiable DTW loss for time series: https://arxiv.org/pdf/1703.01541.pdf
  - Sequence alignment kernel: 2004, Bioinformatics, Protein homology detection using string alignment kernels
2021, ICLR, Bertology Meets Biology Interpreting Attention Protein Language Moldes
2021, Bioinformatics, DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
- https://github.com/jerryji1993/DNABERT
2021, PNAS, Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
- https://github.com/facebookresearch/esm
2021, Current Potocols, Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets
- https://github.com/sacdallago/bio_embeddings