Play with Genomic Features

Genomic data is often genomic ranges associate with some metadata
gtf, bed, vcf, sam, wig, bedgraph …
Multiple tools is available for manipulate such genomic ranges

A closer look at file formats

The coordinate system
gtf / gff file
bed / bedgraph and bed12 format
- bed: genome interval associate with some metadata
- bed12: a specific bed, each line specify the structure of a transcript

# exampled bed12 record
  1      2     3           4         5  6   7     8   9 10     11           12
chr1	11868	14409	ENST00000456328.2	100	+	11868	11868	0	3	359,109,1189,	0,744,1352,

# field 12 is block start, this coordinate is always relative to left most in the genome
# for tx in forward strand, that is transcript start
# for tx in reverse strand, that is transcript end

vcf format
wiggle and bigwig
sam

Parse / Load different file format & file format conversion

R / bioconductor, https://bioconductor.org/developers/how-to/commonImportsAndClasses/
- rtracklayer support data importing in multiple format
  - “In summary, for the typical use case of combining gene models with experimental data, GFF is preferred for gene models and BigWig is preferred for quantitative score vectors. “
- https://kasperdanielhansen.github.io/genbioconductor/html/rtracklayer_Import.html
- GenomicFeatures is useful for manipulate genome annotation

  library(rtracklayer)
  tx <- rtracklayer::import(file.bed12,"bed")

  library(GenomicFeatures)
  txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
  # Output:
  #Import genomic features from the file as a GRanges object ... OK
  #Prepare the 'metadata' data frame ... OK
  #Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type stop_codon. 

  # Get exons by tx
  # Return a list of GRanges
  tx.exons <- exonsBy(txdb, by="tx")
  #  information was ignored.OK

Get transcript from genome with bed12/gtf gene model

Command line tools
- http://ccb.jhu.edu/software/stringtie/dl/gffread-0.11.4.Linux_x86_64.tar.gz
- Get transcripts fasta from genome fasta

    #gffread <input_gff> [-g <genomic_seqs_fasta> | <dir>][-s <seq_info.fsize>] [-o <outfile>] [-t <trackname>] [-r [[<strand>]<chr>:]<start>..<end> [-R]][-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>][-i <maxintron>] [--bed] [--table <attrlist>] [--sort-by <refseq_list.txt>]
    gffread -g genome.fa -s genome.size -W -M -F -G -A -O -E -w transcriptome.fa -d transcriptome.collapsed.info genome.gtf
    #  -W  for -w and -x options, also write for each fasta record the exon coordinates projected onto the spliced sequence
    #  -M/--merge : cluster the input transcripts into loci, collapsing matching transcripts (those with the same exact introns and fully contained)
    # -F  preserve all GFF attributes (for non-exon features)
    # -G  do not keep exon attributes, move them to the transcript feature (for GFF3 output)
    # -A  use the description field from <seq_info.fsize> and add it as the value for a 'descr' attribute to the GFF record
    # -O  process also non-transcript GFF records (by default non-transcript records are ignored)
    # -E  expose (warn about) duplicate transcript IDs and other potential problems with the given GFF/GTF records
    #  -w  write a fasta file with spliced exons for each GFF transcript

Convert gff to bed12

    gffread -W -M -F -G -A -E --bed {gtf} > {bed}

Get fasta from bed12 file
- The result is strange, need check (seems intron is not removed)

    bedtools getfasta -name+ -fi genome.fasta -bed {bed12} -split -s > tx.fa

Bioconductor packages
- This implementation is quite slow, seems can be improved

  library(Biostrings)
  library(GenomicFeatures)
  library(pbapply)
  
  # Load gene model from gtf file
  txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
  tx.exons <- exonsBy(txdb, by="tx")
  
  # Load genome sequence from fasta file
  fasta.path <- "Path to genomic fasta file"
  genome.bs <- Biostrings::readDNAStringSet(fasta.path)
  
  # Extract tx sequence
  getTxSequence <- function(tx.grs,genome.dna.set){
  chrom <- seqnames(tx.grs)[1]
  strand <- strand(tx.grs)[1]
  dna <- genome.dna.set[[chrom]]
  tx.name <- gsub(".exon.*$","",tx.grs$exon_name[1])
  tx.irs <- IRangesList(ranges(tx.grs))
  names(tx.irs) <- tx.name
  tx.seq <-   GenomicFeatures::extractTranscriptSeqs(dna,transcripts=tx.irs,strand=strand)
  tx.seq
  }
  
  tx.sequences <- unlist(pblapply(tx.exons,100,getTxSequence,genome.dna.set=genome.bs))
  names(tx.sequences) <- NULL
  tx.sequences <- do.call(c,tx.sequences)
  Biostrings::writeXStringSet(tx.sequences,"tx.fasta")

Perl scripts
- Check this https://metacpan.org/release/Bio-ViennaNGS, seems not work for spliced gene

Liftover between different version of genome

Projection between genome coordinate and transcript coordinate

You have a list of genome interval, and gene model, you want to project the interval to the coordinate of transcript
Bioconductor implementation
- GenomicFeatures package
  - mapToTranscripts: map genome coordinate to tx coordinate
  - mapFromTranscripts: map transcript coordinate.
    - note that if the a input genomic interval spans two exons, it will output a genomic interval contains a intron
    - may also take a look at https://github.com/uaauaguga/NGS-Analysis-Notes/blob/master/scripts/genome-crd2tx-crd.py (still need some test…)
  - transcriptLocs2refLocs
Get coordiante relative to a single transcript

    library(GenomicFeatures)
    gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf"
    txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
    tx.exons <- exonsBy(txdb, by="tx",use.names=TRUE) # use.names tells the function use tx name as key of the list
    tx.used <- tx.exons["AT1G01010.1"]
    gr <- GRanges(c("1","1","2","2"),IRanges(c(4000,4160,2000,3000), width=100),c("+","+","-","+"))
    names(gr) <- rep("AT1G01010.1",length(gr))
    mapToTranscripts(gr,tx.used)

Map genomic intervals from genome coordiante to transcript coordinate

#!/usr/bin/env Rscript
library(GenomicFeatures)
library(rtracklayer)
peaks <- rtracklayer::import("bed/AtGRP7.x.sites.21nt.merged.bed","bed")
txdb <- makeTxDbFromGFF(file="genome/gff/tair10.gff",format="gff")
peaks.tx <- mapToTranscripts(peaks,txdb)
tx.info <- select(txdb,keys =as.character(seqnames(peaks.tx)),columns=c("TXNAME"),keytype="TXID")
mcols(peaks.tx)[["name"]] <- unlist(tx.info$TXNAME)
export.bed(peaks.tx, con = 'peaks.tx.bed' )

Get CDS relative to transcript coordinate

library(GenomicFeatures)
gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf"
txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
tx.utr5p <- fiveUTRsByTranscript(txdb,use.names=TRUE)
tx.cds <- cdsBy(txdb,by="tx",use.names=TRUE)
utr5p.lengths <- sum(width(tx.utr5p))
cds.lengths <- sum(width(tx.cds))
utr5p.add <- rep(0,length(tx.no5putr))
tx.no5putr <- setdiff(names(cds.lengths),names(utr5p.lengths))
names(utr5p.add) <- tx.no5putr
utr5p.lengths <- c(utr5p.lengths,utr5p.add)
utr5p.lengths <- utr5p.lengths[names(cds.lengths)]
tx.coordinates <- data.frame(utr5p.lengths,cds.lengths)
tx.coordinates[["utr5p-start"]] <- 0
tx.coordinates[["utr5p-end"]] <- tx.coordinates[["utr5p.lengths"]]
tx.coordinates[["cds-end"]] <- tx.coordinates[["utr5p.lengths"]] + tx.coordinates[["cds.lengths"]]
utr5p <- tx.coordinates[,c("utr5p-start","utr5p-end")]
utr5p <- utr5p[utr5p[["utr5p-end"]]>0,]
write.table(utr5p,file="5putr.bed",sep="\t",quote = F, col.names = F)
write.table(tx.coordinates[,c("utr5p-end","cds-end")],file="cds.bed",sep="\t",quote = F, col.names = F)

Useful tools and resource

Take intersection of feature sets

Annotate input features with known feature set
- bedtools annotate
- bedtools intersect
Count coverage

bedtools coverage -a features.bed -b input.bam -counts > counts.txt

Take some arithmetic operation across genomic interval

# -o can be arithmetic operation like sum, min, max, mean, median ...
bedtools map -a interval.tosummarize.bed -b input.value.bed  -c 4 -o sum > output.bed

Take union

Merge bed file

# Consider strandness
sort -k1,1 -k2,2n ${input} | bedtools merge -s -c 6 -o distinct | awk 'BEGIN{OFS="\t";}{print $1,$2,$3,".",".",$4}' > ${output}

Merge / reduce exon from different isoforms from a gene

Get gene length for TPM or FPKM calculation (If you use tools like featureCount, gene length information is provided in the output)

#!/usr/bin/env Rscript
# Refer to https://www.biostars.org/p/83901/
library(GenomicFeatures)
gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf"
txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
gene.exons <- exonsBy(txdb, by="gene")
# The key operation is GenomicRanges::reduce
gene.lengths <- sum(GenomicRanges::width(GenomicRanges::reduce(gene.exons)))
write.table(gene.lengths,"gene.length.txt",col.name=F,quote=F,sep="\t")

Take difference

Get intron location from gff/gtf file
bedtools subtract

Algorithm behind these tools

Interval tree
Nested Containment List

Implement similar function in other programming language

perl
- Set::IntervalTree
- Only implement interval tree, should manually handle different chromosomes and strandness
- A example on without considering chromosomes and strandness

#!/usr/bin/env perl
use Set::IntervalTree;
use Getopt::Long;

GetOptions ("database=s" => \my $db_path,   
            "query=s"   => \my $query_path)      
or die("Error in parsing command line arguments\n");

open(FDB,"<",$db_path) or die("Failed opening database file\n");
open(FQUERY,"<",$query_path) or die("Failed opening query file\n");


my $treedb = Set::IntervalTree->new;


#print "Load database file ...\n";
my $chr,$start,$end; #,$name;
while(<FDB>){
  chomp;
  @fields = split "\t",$_;
  $chr = $fields[0];
  $start = $fields[1];
  $end = $fields[2];
  $treedb->insert($_."",$start,$end);
}
#print "Done .\n";


while(<FQUERY>){
  chomp;
  @fields = split "\t",$_;
  $start = $fields[1];
  $end = $fields[2];
  $entries = $treedb->fetch($start,$end);
  for my $entry (@$entries){
    print $entry,"\n";
  }
}

close(FDB);
close(FQUERY);

python
- There are also interval tree implementations in python intervaltree, not try yet
- Seems pyranges is a good alternative in python
- bx-python
C++
- cgranges

A closer look at file formats

Parse / Load different file format & file format conversion

Get transcript from genome with bed12/gtf gene model

Liftover between different version of genome

Projection between genome coordinate and transcript coordinate

Get CDS relative to transcript coordinate

Useful tools and resource

Some task related to interval comparison

Take intersection of feature sets

Take union

Take difference

Algorithm behind these tools

Implement similar function in other programming language