Play with Genomic Features
- Genomic data is often genomic ranges associate with some metadata
- gtf, bed, vcf, sam, wig, bedgraph …
- Multiple tools is available for manipulate such genomic ranges
A closer look at file formats
- The coordinate system
- gtf / gff file
- bed / bedgraph and bed12 format
- bed: genome interval associate with some metadata
- bed12: a specific bed, each line specify the structure of a transcript
# exampled bed12 record
1 2 3 4 5 6 7 8 9 10 11 12
chr1 11868 14409 ENST00000456328.2 100 + 11868 11868 0 3 359,109,1189, 0,744,1352,
# field 12 is block start, this coordinate is always relative to left most in the genome
# for tx in forward strand, that is transcript start
# for tx in reverse strand, that is transcript end
- vcf format
- wiggle and bigwig
- sam
Parse / Load different file format & file format conversion
- R / bioconductor, https://bioconductor.org/developers/how-to/commonImportsAndClasses/
- rtracklayer support data importing in multiple format
- “In summary, for the typical use case of combining gene models with experimental data, GFF is preferred for gene models and
BigWigis preferred for quantitative score vectors. “
- “In summary, for the typical use case of combining gene models with experimental data, GFF is preferred for gene models and
- https://kasperdanielhansen.github.io/genbioconductor/html/rtracklayer_Import.html
- GenomicFeatures is useful for manipulate genome annotation
- rtracklayer support data importing in multiple format
library(rtracklayer)
tx <- rtracklayer::import(file.bed12,"bed")
library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
# Output:
#Import genomic features from the file as a GRanges object ... OK
#Prepare the 'metadata' data frame ... OK
#Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type stop_codon.
# Get exons by tx
# Return a list of GRanges
tx.exons <- exonsBy(txdb, by="tx")
# information was ignored.OK
Get transcript from genome with bed12/gtf gene model
-
Command line tools
-
http://ccb.jhu.edu/software/stringtie/dl/gffread-0.11.4.Linux_x86_64.tar.gz
-
Get transcripts fasta from genome fasta
-
#gffread <input_gff> [-g <genomic_seqs_fasta> | <dir>][-s <seq_info.fsize>] [-o <outfile>] [-t <trackname>] [-r [[<strand>]<chr>:]<start>..<end> [-R]][-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>][-i <maxintron>] [--bed] [--table <attrlist>] [--sort-by <refseq_list.txt>]
gffread -g genome.fa -s genome.size -W -M -F -G -A -O -E -w transcriptome.fa -d transcriptome.collapsed.info genome.gtf
# -W for -w and -x options, also write for each fasta record the exon coordinates projected onto the spliced sequence
# -M/--merge : cluster the input transcripts into loci, collapsing matching transcripts (those with the same exact introns and fully contained)
# -F preserve all GFF attributes (for non-exon features)
# -G do not keep exon attributes, move them to the transcript feature (for GFF3 output)
# -A use the description field from <seq_info.fsize> and add it as the value for a 'descr' attribute to the GFF record
# -O process also non-transcript GFF records (by default non-transcript records are ignored)
# -E expose (warn about) duplicate transcript IDs and other potential problems with the given GFF/GTF records
# -w write a fasta file with spliced exons for each GFF transcript
- Convert gff to bed12
gffread -W -M -F -G -A -E --bed {gtf} > {bed}
-
Get fasta from bed12 file
- The result is strange, need check (seems intron is not removed)
bedtools getfasta -name+ -fi genome.fasta -bed {bed12} -split -s > tx.fa
-
Bioconductor packages
- This implementation is quite slow, seems can be improved
library(Biostrings)
library(GenomicFeatures)
library(pbapply)
# Load gene model from gtf file
txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
tx.exons <- exonsBy(txdb, by="tx")
# Load genome sequence from fasta file
fasta.path <- "Path to genomic fasta file"
genome.bs <- Biostrings::readDNAStringSet(fasta.path)
# Extract tx sequence
getTxSequence <- function(tx.grs,genome.dna.set){
chrom <- seqnames(tx.grs)[1]
strand <- strand(tx.grs)[1]
dna <- genome.dna.set[[chrom]]
tx.name <- gsub(".exon.*$","",tx.grs$exon_name[1])
tx.irs <- IRangesList(ranges(tx.grs))
names(tx.irs) <- tx.name
tx.seq <- GenomicFeatures::extractTranscriptSeqs(dna,transcripts=tx.irs,strand=strand)
tx.seq
}
tx.sequences <- unlist(pblapply(tx.exons,100,getTxSequence,genome.dna.set=genome.bs))
names(tx.sequences) <- NULL
tx.sequences <- do.call(c,tx.sequences)
Biostrings::writeXStringSet(tx.sequences,"tx.fasta")
-
Perl scripts
- Check this https://metacpan.org/release/Bio-ViennaNGS, seems not work for spliced gene
Liftover between different version of genome
Projection between genome coordinate and transcript coordinate
-
You have a list of genome interval, and gene model, you want to project the interval to the coordinate of transcript
-
Bioconductor implementation
-
GenomicFeatures package
-
mapToTranscripts: map genome coordinate to tx coordinate mapFromTranscripts: map transcript coordinate.- note that if the a input genomic interval spans two exons, it will output a genomic interval contains a intron
- may also take a look at https://github.com/uaauaguga/NGS-Analysis-Notes/blob/master/scripts/genome-crd2tx-crd.py (still need some test…)
transcriptLocs2refLocs
-
-
-
Get coordiante relative to a single transcript
library(GenomicFeatures)
gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf"
txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
tx.exons <- exonsBy(txdb, by="tx",use.names=TRUE) # use.names tells the function use tx name as key of the list
tx.used <- tx.exons["AT1G01010.1"]
gr <- GRanges(c("1","1","2","2"),IRanges(c(4000,4160,2000,3000), width=100),c("+","+","-","+"))
names(gr) <- rep("AT1G01010.1",length(gr))
mapToTranscripts(gr,tx.used)
- Map genomic intervals from genome coordiante to transcript coordinate
#!/usr/bin/env Rscript
library(GenomicFeatures)
library(rtracklayer)
peaks <- rtracklayer::import("bed/AtGRP7.x.sites.21nt.merged.bed","bed")
txdb <- makeTxDbFromGFF(file="genome/gff/tair10.gff",format="gff")
peaks.tx <- mapToTranscripts(peaks,txdb)
tx.info <- select(txdb,keys =as.character(seqnames(peaks.tx)),columns=c("TXNAME"),keytype="TXID")
mcols(peaks.tx)[["name"]] <- unlist(tx.info$TXNAME)
export.bed(peaks.tx, con = 'peaks.tx.bed' )
Get CDS relative to transcript coordinate
library(GenomicFeatures)
gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf"
txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
tx.utr5p <- fiveUTRsByTranscript(txdb,use.names=TRUE)
tx.cds <- cdsBy(txdb,by="tx",use.names=TRUE)
utr5p.lengths <- sum(width(tx.utr5p))
cds.lengths <- sum(width(tx.cds))
utr5p.add <- rep(0,length(tx.no5putr))
tx.no5putr <- setdiff(names(cds.lengths),names(utr5p.lengths))
names(utr5p.add) <- tx.no5putr
utr5p.lengths <- c(utr5p.lengths,utr5p.add)
utr5p.lengths <- utr5p.lengths[names(cds.lengths)]
tx.coordinates <- data.frame(utr5p.lengths,cds.lengths)
tx.coordinates[["utr5p-start"]] <- 0
tx.coordinates[["utr5p-end"]] <- tx.coordinates[["utr5p.lengths"]]
tx.coordinates[["cds-end"]] <- tx.coordinates[["utr5p.lengths"]] + tx.coordinates[["cds.lengths"]]
utr5p <- tx.coordinates[,c("utr5p-start","utr5p-end")]
utr5p <- utr5p[utr5p[["utr5p-end"]]>0,]
write.table(utr5p,file="5putr.bed",sep="\t",quote = F, col.names = F)
write.table(tx.coordinates[,c("utr5p-end","cds-end")],file="cds.bed",sep="\t",quote = F, col.names = F)
Useful tools and resource
-
Useful tools
-
Useful links
- Manipulate genomic interval
- https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GRanges_and_GRangesList_slides.pdf
- http://bioconductor.org/help/course-materials/2010/EMBL2010/GenomicRanges.pdf
- http://bioconductor.org/packages/release/bioc/vignettes/HelloRanges/inst/doc/tutorial.pdf
- http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf
- http://bioconductor.org/help/course-materials/2014/CSAMA2014/2_Tuesday/lectures/Ranges_Sequences_and_Alignments-Lawrence.pdf
- https://www.bioconductor.org/help/course-materials/2016/BioC2016/ConcurrentWorkshops1/Obenchain/Lecture-sequences.html
- Manipulate genomic interval
Some task related to interval comparison
Take intersection of feature sets
- Annotate input features with known feature set
bedtools annotatebedtools intersect
- Count coverage
bedtools coverage -a features.bed -b input.bam -counts > counts.txt
- Take some arithmetic operation across genomic interval
# -o can be arithmetic operation like sum, min, max, mean, median ...
bedtools map -a interval.tosummarize.bed -b input.value.bed -c 4 -o sum > output.bed
Take union
- Merge bed file
# Consider strandness
sort -k1,1 -k2,2n ${input} | bedtools merge -s -c 6 -o distinct | awk 'BEGIN{OFS="\t";}{print $1,$2,$3,".",".",$4}' > ${output}
-
Merge / reduce exon from different isoforms from a gene
- Get gene length for TPM or FPKM calculation (If you use tools like featureCount, gene length information is provided in the output)
#!/usr/bin/env Rscript # Refer to https://www.biostars.org/p/83901/ library(GenomicFeatures) gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf" txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf") gene.exons <- exonsBy(txdb, by="gene") # The key operation is GenomicRanges::reduce gene.lengths <- sum(GenomicRanges::width(GenomicRanges::reduce(gene.exons))) write.table(gene.lengths,"gene.length.txt",col.name=F,quote=F,sep="\t")
- Get gene length for TPM or FPKM calculation (If you use tools like featureCount, gene length information is provided in the output)
Take difference
- Get intron location from gff/gtf file
bedtools subtract
Algorithm behind these tools
- Interval tree
- Nested Containment List
Implement similar function in other programming language
- perl
- Set::IntervalTree
- Only implement interval tree, should manually handle different chromosomes and strandness
- A example on without considering chromosomes and strandness
#!/usr/bin/env perl
use Set::IntervalTree;
use Getopt::Long;
GetOptions ("database=s" => \my $db_path,
"query=s" => \my $query_path)
or die("Error in parsing command line arguments\n");
open(FDB,"<",$db_path) or die("Failed opening database file\n");
open(FQUERY,"<",$query_path) or die("Failed opening query file\n");
my $treedb = Set::IntervalTree->new;
#print "Load database file ...\n";
my $chr,$start,$end; #,$name;
while(<FDB>){
chomp;
@fields = split "\t",$_;
$chr = $fields[0];
$start = $fields[1];
$end = $fields[2];
$treedb->insert($_."",$start,$end);
}
#print "Done .\n";
while(<FQUERY>){
chomp;
@fields = split "\t",$_;
$start = $fields[1];
$end = $fields[2];
$entries = $treedb->fetch($start,$end);
for my $entry (@$entries){
print $entry,"\n";
}
}
close(FDB);
close(FQUERY);
- python
- There are also interval tree implementations in python intervaltree, not try yet
- Seems pyranges is a good alternative in python
- bx-python
- C++