Get Annotations
Play with Genome annotation
Where to download genome annotation in gtf/gff
Annotation resources in bioconductor
- Some material for reading
- https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
- https://www.bioconductor.org/help/course-materials/2019/BSS2019/05_Annotations.html
- Also see discussion here https://www.biostars.org/p/287871/ for difference between biomart and bioconductor annotation resources
Biomart
- The biomaRt package is simply a way to programmatically access the Biomart server and get the results back into R
A tidy way for id conversion
library(org.Mm.eg.db)
# convert ensembl id to entrez id
entrez.ids <- mapIds(org.Mm.eg.db, keys=gene.ids, column="ENTREZID", keytype="ENSEMBL", multiVals="first")
Mapping between homolog
Functional annotation
- So call functional annotation is actually mappings from genes to functions
- If we have a gene set, to see whether they are related to a function, we use genes that annotated to such function, and use the overlap between known gene set and given gene set as a proxy for functional relevance
- If we have a ranked list, we can do similar things
Get KEGG annotations with REST API
- Read https://www.kegg.jp/kegg/rest/keggapi.html for the native version of KEGG’s REST API
-
Several packages provides wrappers for these API
- In Biopython
from Bio.KEGG import REST
from tqdm import tqdm
import re
# Get KEGG pathway id and description of human
pathWayIds = REST.kegg_list("pathway","hsa").read().strip().split("\n")
# Retrieve data of a KEGG pathway
pathWayId = pathWayIds[0].split("\t")[0].split(":")[1]
# hsa00010
pathWayData = REST.kegg_get(pathWayId).read()
# A structured string,quite complicate
# Here we get entrez id from this string as an example
def getEntrezId(result):
entrezs = []
flag = 0
for line in result.split("\n"):
data = re.split("\s+",line)
if len(data)<2:
continue
if data[0]=="GENE":
flag = 1
entrezs.append(data[1])
elif flag == 1:
if data[0]=="":
entrezs.append(data[1])
else:
break
return entrezs
entrezIds = getEntrezId(pathWayData)
- In bioconductor package KEGGREST
library(KEGGREST)
homo.pathways <- keggList("pathway","hsa")
pathway.id <- names(homo.pathways)[1]
pathway.data <- keggGet(c(pathway.id))
# keggGet parsed the returned string to a list
entrez.ids <- pathway.data[[1]]$GENE[c(T,F)]
# Here the [c(T,F)] trick take element with odds index
Get GO annotation in R
- See discussion here https://www.biostars.org/p/52101/#68158
library(GO.db)
library(AnnotationDbi)
library(org.Hs.eg.db)
goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
go.ids <- c(names(goterms)[1:2])
go.gene.list <- AnnotationDbi::mapIds(org.Hs.eg.db, keys=go.ids, column="ENTREZID",
keytype="GOALL", multiVals='list')
- For enrichment of non-model organism, see Annotationhub
library(AnnotationDbi)
library(AnnotationHub)
hub <- AnnotationHub()
# Query species you want
dm <- query(ah, c("cgriseus"))
# Retrieve database from dm object with its id
org.cgriseus.db <- hub[["AH70586"]]
# THen simply replace org.Hs.eg.db with your database for retrieve go term
- This this post for further reading
Softwares and tools for enrichment analysis
- Under most situation, its not necessary to download gene set and perform hypergeometric test your self for enrichment analysis
- Here list two useful tools
- clusterProfiler
- More bioinformatician oriented
- metascape
- Web tool, more bench scientist oriented
Get annotation from ucsc mysql server
# show available tables
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SHOW tables"
# show structure of a table
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "DESC rmsk"
# query a database
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SELECT * FROM geneid" > geneid.txt
Play with SQLite
- Many bioinformatic annotations can be directly accessed from bioconductor packages in SQLite format