Play with Genome annotation

Where to download genome annotation in gtf/gff

Annotation resources in bioconductor

Biomart

  • The biomaRt package is simply a way to programmatically access the Biomart server and get the results back into R

A tidy way for id conversion

library(org.Mm.eg.db)
# convert ensembl id to entrez id
entrez.ids <- mapIds(org.Mm.eg.db, keys=gene.ids, column="ENTREZID", keytype="ENSEMBL", multiVals="first")

Mapping between homolog

Functional annotation

  • So call functional annotation is actually mappings from genes to functions
  • If we have a gene set, to see whether they are related to a function, we use genes that annotated to such function, and use the overlap between known gene set and given gene set as a proxy for functional relevance
  • If we have a ranked list, we can do similar things

Get KEGG annotations with REST API

from Bio.KEGG import REST
from tqdm import tqdm
import re

# Get KEGG pathway id and description of human
pathWayIds = REST.kegg_list("pathway","hsa").read().strip().split("\n")

# Retrieve data of a KEGG pathway
pathWayId = pathWayIds[0].split("\t")[0].split(":")[1]
# hsa00010
pathWayData = REST.kegg_get(pathWayId).read()
# A structured string,quite complicate
# Here we get entrez id from this string as an example
def getEntrezId(result):  
    entrezs = []
    flag = 0
    for line in result.split("\n"):
        data = re.split("\s+",line)
        if len(data)<2:
            continue
        if data[0]=="GENE":
            flag = 1
            entrezs.append(data[1])
        elif flag == 1:
            if data[0]=="":
                entrezs.append(data[1])
            else:
                break
    return entrezs

entrezIds = getEntrezId(pathWayData)
library(KEGGREST)
homo.pathways <- keggList("pathway","hsa")
pathway.id <- names(homo.pathways)[1]
pathway.data <- keggGet(c(pathway.id))
# keggGet parsed the returned string to a list
entrez.ids <- pathway.data[[1]]$GENE[c(T,F)]
# Here the [c(T,F)] trick take element with odds index

Get GO annotation in R

library(GO.db)
library(AnnotationDbi)
library(org.Hs.eg.db)
goterms <- AnnotationDbi::Ontology(GO.db::GOTERM)
go.ids <- c(names(goterms)[1:2])
go.gene.list <- AnnotationDbi::mapIds(org.Hs.eg.db, keys=go.ids, column="ENTREZID",
                                  keytype="GOALL", multiVals='list')
library(AnnotationDbi)
library(AnnotationHub)
hub <- AnnotationHub()
# Query species you want 
dm <- query(ah, c("cgriseus"))
# Retrieve database from dm object with its id
org.cgriseus.db <- hub[["AH70586"]]
# THen simply replace org.Hs.eg.db with your database for retrieve go term

Softwares and tools for enrichment analysis

  • Under most situation, its not necessary to download gene set and perform hypergeometric test your self for enrichment analysis
  • Here list two useful tools
  • clusterProfiler
    • More bioinformatician oriented
  • metascape
    • Web tool, more bench scientist oriented

Get annotation from ucsc mysql server

# show available tables
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SHOW tables"
# show structure of a table
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "DESC rmsk"
# query a database
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "SELECT * FROM geneid" > geneid.txt

Play with SQLite

  • Many bioinformatic annotations can be directly accessed from bioconductor packages in SQLite format