Get Annotations
Play with Genome annotation
Where to download genome annotation in gtf/gff
Annotation resources in bioconductor
- Some material for reading
 - https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
 - https://www.bioconductor.org/help/course-materials/2019/BSS2019/05_Annotations.html
 - Also see discussion here https://www.biostars.org/p/287871/ for difference between biomart and bioconductor annotation resources
 
Biomart
- The biomaRt package is simply a way to programmatically access the Biomart server and get the results back into R
 
A tidy way for id conversion
library(org.Mm.eg.db)
# convert ensembl id to entrez id
entrez.ids <- mapIds(org.Mm.eg.db, keys=gene.ids, column="ENTREZID", keytype="ENSEMBL", multiVals="first")
Mapping between homolog
Functional annotation
- So call functional annotation is actually mappings from genes to functions
 - If we have a gene set, to see whether they are related to a function, we use genes that annotated to such function, and use the overlap between known gene set and given gene set as a proxy for functional relevance
 - If we have a ranked list, we can do similar things
 
Get KEGG annotations with REST API
- Read https://www.kegg.jp/kegg/rest/keggapi.html for the native version of KEGG’s REST API
 - 
    
Several packages provides wrappers for these API
 - In Biopython
 
from Bio.KEGG import REST
from tqdm import tqdm
import re
# Get KEGG pathway id and description of human
pathWayIds = REST.kegg_list("pathway","hsa").read().strip().split("\n")
# Retrieve data of a KEGG pathway
pathWayId = pathWayIds[0].split("\t")[0].split(":")[1]
# hsa00010
pathWayData = REST.kegg_get(pathWayId).read()
# A structured string,quite complicate
# Here we get entrez id from this string as an example
def getEntrezId(result):  
    entrezs = []
    flag = 0
    for line in result.split("\n"):
        data = re.split("\s+",line)
        if len(data)<2:
            continue
        if data[0]=="GENE":
            flag = 1
            entrezs.append(data[1])
        elif flag == 1:
            if data[0]=="":
                entrezs.append(data[1])
            else:
                break
    return entrezs
entrezIds = getEntrezId(pathWayData)
- In bioconductor package KEGGREST
 
library(KEGGREST)
homo.pathways <- keggList("pathway","hsa")
pathway.id <- names(homo.pathways)[1]
pathway.data <- keggGet(c(pathway.id))
# keggGet parsed the returned string to a list
entrez.ids <- pathway.data[[1]]$GENE[c(T,F)]
# Here the [c(T,F)] trick take element with odds index
Get GO annotation in R
- See discussion here https://www.biostars.org/p/52101/#68158
 
library(GO.db)
library(AnnotationDbi)
library(org.Hs.eg.db)
goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
go.ids <- c(names(goterms)[1:2])
go.gene.list <- AnnotationDbi::mapIds(org.Hs.eg.db, keys=go.ids, column="ENTREZID",
                                  keytype="GOALL", multiVals='list')
- For enrichment of non-model organism, see Annotationhub
 
library(AnnotationDbi)
library(AnnotationHub)
hub <- AnnotationHub()
# Query species you want 
dm <- query(ah, c("cgriseus"))
# Retrieve database from dm object with its id
org.cgriseus.db <- hub[["AH70586"]]
# THen simply replace org.Hs.eg.db with your database for retrieve go term
- This this post for further reading
 
Softwares and tools for enrichment analysis
- Under most situation, its not necessary to download gene set and perform hypergeometric test your self for enrichment analysis
 - Here list two useful tools
 - clusterProfiler
    
- More bioinformatician oriented
 
 - metascape
    
- Web tool, more bench scientist oriented
 
 
Get annotation from ucsc mysql server
# show available tables
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SHOW tables"
# show structure of a table
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "DESC rmsk"
# query a database
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SELECT * FROM geneid" > geneid.txt
Play with SQLite
- Many bioinformatic annotations can be directly accessed from bioconductor packages in SQLite format