Notes for deep learning in biological sequence analysis
- A collection of studies that applies neural network to learn representations for biological sequence
Related Resources
-
2017, Cell Systems, Enhancing Evolutionary Couplings with DeepConvolutional Neural Networks
-
2019, Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
- 2019, Nature Method, Unified rational protein engineering with sequence-based deep representation learning
- MLSTM, minimize next amino-acid prediction cross entropy loss, use fixed size hidden states as feature representation
- One hidden state activity for each amino acid
- Average the activity cross all AAs in the full length sequence to get representation for the protein
- Another choice is use the activity of the last hidden state
- doc2vec: https://github.com/fhalab/embeddings_reproduction
- https://github.com/churchlab/UniRep
- 2018, NeurIPS, Neural Edit Operations for Biological Sequences
- Replace argmax with softmax in sequence alignment, to make the sequence alignment loss differentiable
- Related works:
- Differentiable DTW loss for time series: https://arxiv.org/pdf/1703.01541.pdf
- Sequence alignment kernel: 2004, Bioinformatics, Protein homology detection using string alignment kernels
-
2021, ICLR, Bertology Meets Biology Interpreting Attention Protein Language Moldes
- 2021, Bioinformatics, DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
- 2021, PNAS, Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
- 2021, Current Potocols, Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets