Get Annotations

Play with Genome annotation

Where to download genome annotation in gtf/gff

Gencode
- mice and human
Ensembl

Annotation resources in bioconductor

Some material for reading
https://bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
https://www.bioconductor.org/help/course-materials/2019/BSS2019/05_Annotations.html
Also see discussion here https://www.biostars.org/p/287871/ for difference between biomart and bioconductor annotation resources

Biomart

The biomaRt package is simply a way to programmatically access the Biomart server and get the results back into R

A tidy way for id conversion

library(org.Mm.eg.db)
# convert ensembl id to entrez id
entrez.ids <- mapIds(org.Mm.eg.db, keys=gene.ids, column="ENTREZID", keytype="ENSEMBL", multiVals="first")

Mapping between homolog

https://www.r-bloggers.com/2016/10/converting-mouse-to-human-gene-names-with-biomart-package/

Functional annotation

So call functional annotation is actually mappings from genes to functions
If we have a gene set, to see whether they are related to a function, we use genes that annotated to such function, and use the overlap between known gene set and given gene set as a proxy for functional relevance
If we have a ranked list, we can do similar things

Get KEGG annotations with REST API

Read https://www.kegg.jp/kegg/rest/keggapi.html for the native version of KEGG’s REST API
Several packages provides wrappers for these API
In Biopython

from Bio.KEGG import REST
from tqdm import tqdm
import re

# Get KEGG pathway id and description of human
pathWayIds = REST.kegg_list("pathway","hsa").read().strip().split("\n")

# Retrieve data of a KEGG pathway
pathWayId = pathWayIds[0].split("\t")[0].split(":")[1]
# hsa00010
pathWayData = REST.kegg_get(pathWayId).read()
# A structured string,quite complicate
# Here we get entrez id from this string as an example
def getEntrezId(result):  
    entrezs = []
    flag = 0
    for line in result.split("\n"):
        data = re.split("\s+",line)
        if len(data)<2:
            continue
        if data[0]=="GENE":
            flag = 1
            entrezs.append(data[1])
        elif flag == 1:
            if data[0]=="":
                entrezs.append(data[1])
            else:
                break
    return entrezs

entrezIds = getEntrezId(pathWayData)

In bioconductor package KEGGREST

library(KEGGREST)
homo.pathways <- keggList("pathway","hsa")
pathway.id <- names(homo.pathways)[1]
pathway.data <- keggGet(c(pathway.id))
# keggGet parsed the returned string to a list
entrez.ids <- pathway.data[[1]]$GENE[c(T,F)]
# Here the [c(T,F)] trick take element with odds index

Get GO annotation in R

See discussion here https://www.biostars.org/p/52101/#68158

library(GO.db)
library(AnnotationDbi)
library(org.Hs.eg.db)
goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
go.ids <- c(names(goterms)[1:2])
go.gene.list <- AnnotationDbi::mapIds(org.Hs.eg.db, keys=go.ids, column="ENTREZID",
                                  keytype="GOALL", multiVals='list')

For enrichment of non-model organism, see Annotationhub

library(AnnotationDbi)
library(AnnotationHub)
hub <- AnnotationHub()
# Query species you want 
dm <- query(ah, c("cgriseus"))
# Retrieve database from dm object with its id
org.cgriseus.db <- hub[["AH70586"]]
# THen simply replace org.Hs.eg.db with your database for retrieve go term

This this post for further reading

Softwares and tools for enrichment analysis

Under most situation, its not necessary to download gene set and perform hypergeometric test your self for enrichment analysis
Here list two useful tools
clusterProfiler
- More bioinformatician oriented
metascape
- Web tool, more bench scientist oriented

Get annotation from ucsc mysql server

# show available tables
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SHOW tables"
# show structure of a table
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "DESC rmsk"
# query a database
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SELECT * FROM geneid" > geneid.txt

Play with SQLite

Many bioinformatic annotations can be directly accessed from bioconductor packages in SQLite format