Play with Genomic Features
- Genomic data is often genomic ranges associate with some metadata
 - gtf, bed, vcf, sam, wig, bedgraph …
 - Multiple tools is available for manipulate such genomic ranges
 
A closer look at file formats
- The coordinate system
 - gtf / gff file
 - bed / bedgraph and bed12 format
    
- bed: genome interval associate with some metadata
 - bed12: a specific bed, each line specify the structure of a transcript
 
 
# exampled bed12 record
  1      2     3           4         5  6   7     8   9 10     11           12
chr1	11868	14409	ENST00000456328.2	100	+	11868	11868	0	3	359,109,1189,	0,744,1352,
# field 12 is block start, this coordinate is always relative to left most in the genome
# for tx in forward strand, that is transcript start
# for tx in reverse strand, that is transcript end
- vcf format
 - wiggle and bigwig
 - sam
 
Parse / Load different file format & file format conversion
- R / bioconductor, https://bioconductor.org/developers/how-to/commonImportsAndClasses/
    
- rtracklayer support data importing in multiple format
        
- “In summary, for the typical use case of combining gene models with experimental data, GFF is preferred for gene models and 
BigWigis preferred for quantitative score vectors. “ 
 - “In summary, for the typical use case of combining gene models with experimental data, GFF is preferred for gene models and 
 - https://kasperdanielhansen.github.io/genbioconductor/html/rtracklayer_Import.html
 - GenomicFeatures is useful for manipulate genome annotation
 
 - rtracklayer support data importing in multiple format
        
 
  library(rtracklayer)
  tx <- rtracklayer::import(file.bed12,"bed")
  library(GenomicFeatures)
  txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
  # Output:
  #Import genomic features from the file as a GRanges object ... OK
  #Prepare the 'metadata' data frame ... OK
  #Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type stop_codon. 
  # Get exons by tx
  # Return a list of GRanges
  tx.exons <- exonsBy(txdb, by="tx")
  #  information was ignored.OK
Get transcript from genome with bed12/gtf gene model
- 
    
Command line tools
- 
        
http://ccb.jhu.edu/software/stringtie/dl/gffread-0.11.4.Linux_x86_64.tar.gz
 - 
        
Get transcripts fasta from genome fasta
 
 - 
        
 
    #gffread <input_gff> [-g <genomic_seqs_fasta> | <dir>][-s <seq_info.fsize>] [-o <outfile>] [-t <trackname>] [-r [[<strand>]<chr>:]<start>..<end> [-R]][-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>][-i <maxintron>] [--bed] [--table <attrlist>] [--sort-by <refseq_list.txt>]
    gffread -g genome.fa -s genome.size -W -M -F -G -A -O -E -w transcriptome.fa -d transcriptome.collapsed.info genome.gtf
    #  -W  for -w and -x options, also write for each fasta record the exon coordinates projected onto the spliced sequence
    #  -M/--merge : cluster the input transcripts into loci, collapsing matching transcripts (those with the same exact introns and fully contained)
    # -F  preserve all GFF attributes (for non-exon features)
    # -G  do not keep exon attributes, move them to the transcript feature (for GFF3 output)
    # -A  use the description field from <seq_info.fsize> and add it as the value for a 'descr' attribute to the GFF record
    # -O  process also non-transcript GFF records (by default non-transcript records are ignored)
    # -E  expose (warn about) duplicate transcript IDs and other potential problems with the given GFF/GTF records
    #  -w  write a fasta file with spliced exons for each GFF transcript
- Convert gff to bed12
 
    gffread -W -M -F -G -A -E --bed {gtf} > {bed}
- 
    
Get fasta from bed12 file
- The result is strange, need check (seems intron is not removed)
 
 
    bedtools getfasta -name+ -fi genome.fasta -bed {bed12} -split -s > tx.fa
- 
    
Bioconductor packages
- This implementation is quite slow, seems can be improved
 
 
  library(Biostrings)
  library(GenomicFeatures)
  library(pbapply)
  
  # Load gene model from gtf file
  txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
  tx.exons <- exonsBy(txdb, by="tx")
  
  # Load genome sequence from fasta file
  fasta.path <- "Path to genomic fasta file"
  genome.bs <- Biostrings::readDNAStringSet(fasta.path)
  
  # Extract tx sequence
  getTxSequence <- function(tx.grs,genome.dna.set){
  chrom <- seqnames(tx.grs)[1]
  strand <- strand(tx.grs)[1]
  dna <- genome.dna.set[[chrom]]
  tx.name <- gsub(".exon.*$","",tx.grs$exon_name[1])
  tx.irs <- IRangesList(ranges(tx.grs))
  names(tx.irs) <- tx.name
  tx.seq <-   GenomicFeatures::extractTranscriptSeqs(dna,transcripts=tx.irs,strand=strand)
  tx.seq
  }
  
  tx.sequences <- unlist(pblapply(tx.exons,100,getTxSequence,genome.dna.set=genome.bs))
  names(tx.sequences) <- NULL
  tx.sequences <- do.call(c,tx.sequences)
  Biostrings::writeXStringSet(tx.sequences,"tx.fasta")
- 
    
Perl scripts
- Check this https://metacpan.org/release/Bio-ViennaNGS, seems not work for spliced gene
 
 
Liftover between different version of genome
Projection between genome coordinate and transcript coordinate
- 
    
You have a list of genome interval, and gene model, you want to project the interval to the coordinate of transcript
 - 
    
Bioconductor implementation
- 
        
GenomicFeatures package
- 
            
mapToTranscripts: map genome coordinate to tx coordinate mapFromTranscripts: map transcript coordinate.- note that if the a input genomic interval spans two exons, it will output a genomic interval contains a intron
 - may also take a look at https://github.com/uaauaguga/NGS-Analysis-Notes/blob/master/scripts/genome-crd2tx-crd.py (still need some test…)
 
transcriptLocs2refLocs
 - 
            
 
 - 
        
 - 
    
Get coordiante relative to a single transcript
 
    library(GenomicFeatures)
    gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf"
    txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
    tx.exons <- exonsBy(txdb, by="tx",use.names=TRUE) # use.names tells the function use tx name as key of the list
    tx.used <- tx.exons["AT1G01010.1"]
    gr <- GRanges(c("1","1","2","2"),IRanges(c(4000,4160,2000,3000), width=100),c("+","+","-","+"))
    names(gr) <- rep("AT1G01010.1",length(gr))
    mapToTranscripts(gr,tx.used)
- Map genomic intervals from genome coordiante to transcript coordinate
 
#!/usr/bin/env Rscript
library(GenomicFeatures)
library(rtracklayer)
peaks <- rtracklayer::import("bed/AtGRP7.x.sites.21nt.merged.bed","bed")
txdb <- makeTxDbFromGFF(file="genome/gff/tair10.gff",format="gff")
peaks.tx <- mapToTranscripts(peaks,txdb)
tx.info <- select(txdb,keys =as.character(seqnames(peaks.tx)),columns=c("TXNAME"),keytype="TXID")
mcols(peaks.tx)[["name"]] <- unlist(tx.info$TXNAME)
export.bed(peaks.tx, con = 'peaks.tx.bed' )
Get CDS relative to transcript coordinate
library(GenomicFeatures)
gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf"
txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf")
tx.utr5p <- fiveUTRsByTranscript(txdb,use.names=TRUE)
tx.cds <- cdsBy(txdb,by="tx",use.names=TRUE)
utr5p.lengths <- sum(width(tx.utr5p))
cds.lengths <- sum(width(tx.cds))
utr5p.add <- rep(0,length(tx.no5putr))
tx.no5putr <- setdiff(names(cds.lengths),names(utr5p.lengths))
names(utr5p.add) <- tx.no5putr
utr5p.lengths <- c(utr5p.lengths,utr5p.add)
utr5p.lengths <- utr5p.lengths[names(cds.lengths)]
tx.coordinates <- data.frame(utr5p.lengths,cds.lengths)
tx.coordinates[["utr5p-start"]] <- 0
tx.coordinates[["utr5p-end"]] <- tx.coordinates[["utr5p.lengths"]]
tx.coordinates[["cds-end"]] <- tx.coordinates[["utr5p.lengths"]] + tx.coordinates[["cds.lengths"]]
utr5p <- tx.coordinates[,c("utr5p-start","utr5p-end")]
utr5p <- utr5p[utr5p[["utr5p-end"]]>0,]
write.table(utr5p,file="5putr.bed",sep="\t",quote = F, col.names = F)
write.table(tx.coordinates[,c("utr5p-end","cds-end")],file="cds.bed",sep="\t",quote = F, col.names = F)
Useful tools and resource
- 
    
Useful tools
 - 
    
Useful links
- Manipulate genomic interval
        
- https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GRanges_and_GRangesList_slides.pdf
 - http://bioconductor.org/help/course-materials/2010/EMBL2010/GenomicRanges.pdf
 - http://bioconductor.org/packages/release/bioc/vignettes/HelloRanges/inst/doc/tutorial.pdf
 - http://bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf
 - http://bioconductor.org/help/course-materials/2014/CSAMA2014/2_Tuesday/lectures/Ranges_Sequences_and_Alignments-Lawrence.pdf
 - https://www.bioconductor.org/help/course-materials/2016/BioC2016/ConcurrentWorkshops1/Obenchain/Lecture-sequences.html
 
 
 - Manipulate genomic interval
        
 
Some task related to interval comparison
Take intersection of feature sets
- Annotate input features with known feature set
    
bedtools annotatebedtools intersect
 - Count coverage
 
bedtools coverage -a features.bed -b input.bam -counts > counts.txt
- Take some arithmetic operation across genomic interval
 
# -o can be arithmetic operation like sum, min, max, mean, median ...
bedtools map -a interval.tosummarize.bed -b input.value.bed  -c 4 -o sum > output.bed
Take union
- Merge bed file
 
# Consider strandness
sort -k1,1 -k2,2n ${input} | bedtools merge -s -c 6 -o distinct | awk 'BEGIN{OFS="\t";}{print $1,$2,$3,".",".",$4}' > ${output}
- 
    
Merge / reduce exon from different isoforms from a gene
- Get gene length for TPM or FPKM calculation (If you use tools like featureCount, gene length information is provided in the output)
        
#!/usr/bin/env Rscript # Refer to https://www.biostars.org/p/83901/ library(GenomicFeatures) gtf.path <- "annotation/gene-models/gtf/Arabidopsis_thaliana.TAIR10.46.gtf" txdb <- makeTxDbFromGFF(file=gtf.path,format="gtf") gene.exons <- exonsBy(txdb, by="gene") # The key operation is GenomicRanges::reduce gene.lengths <- sum(GenomicRanges::width(GenomicRanges::reduce(gene.exons))) write.table(gene.lengths,"gene.length.txt",col.name=F,quote=F,sep="\t") 
 - Get gene length for TPM or FPKM calculation (If you use tools like featureCount, gene length information is provided in the output)
        
 
Take difference
- Get intron location from gff/gtf file
 bedtools subtract
Algorithm behind these tools
- Interval tree
 - Nested Containment List
 
Implement similar function in other programming language
- perl
    
- Set::IntervalTree
 - Only implement interval tree, should manually handle different chromosomes and strandness
 - A example on without considering chromosomes and strandness
 
 
#!/usr/bin/env perl
use Set::IntervalTree;
use Getopt::Long;
GetOptions ("database=s" => \my $db_path,   
            "query=s"   => \my $query_path)      
or die("Error in parsing command line arguments\n");
open(FDB,"<",$db_path) or die("Failed opening database file\n");
open(FQUERY,"<",$query_path) or die("Failed opening query file\n");
my $treedb = Set::IntervalTree->new;
#print "Load database file ...\n";
my $chr,$start,$end; #,$name;
while(<FDB>){
  chomp;
  @fields = split "\t",$_;
  $chr = $fields[0];
  $start = $fields[1];
  $end = $fields[2];
  $treedb->insert($_."",$start,$end);
}
#print "Done .\n";
while(<FQUERY>){
  chomp;
  @fields = split "\t",$_;
  $start = $fields[1];
  $end = $fields[2];
  $entries = $treedb->fetch($start,$end);
  for my $entry (@$entries){
    print $entry,"\n";
  }
}
close(FDB);
close(FQUERY);
- python
    
- There are also interval tree implementations in python intervaltree, not try yet
 - Seems pyranges is a good alternative in python
 - bx-python
 
 - C++