Aligners and Output Formatting
- This post tries to record what I’ve heard of about the following questions.
    
- Use which aligner, which parameter, to what reference sequence, to map what data
 - How to manipulate output of different aligners, or equivalently, how to manipulate bam files
 
 
Famous tools for pairwise sequence comparison
- Short reads aligner
 - Long reads aligner
 - General propose aligner
    
- blast
 - blat
 
 - Whole genome aligner
 
Short reads mapping: reference and parameters
Reads preprocessing
Adaptor trimming / quality trimming
- If you perform global alignment, then adapter trimming is required
    
Duplication removal / UMI handling
 - You may remove PCR duplicates before alignment or after alignment
    
- Pros: reduce computational burden in reads mapping.
 - Cons: less flexible
        
- post alignment duplication removal allows “mark duplicates” instead of “remove duplicates”
 
 
 
Short read aligners
- 
    
What short read aligners do is actually assign some annotations to each reads, that is which segment of the reads are assigned to which location of the genome in which form
 - 
    
In different aligner, same thing may have different name
 
Map to genome / map to transcriptome
- Seems most project prefer map reads to genome, but some downstream analysis tools (salmon, rsem) requires the reads to be aligned to transcriptome coordinate
 - STAR could direct project genome aligned reads to transcriptome coordinate
 
Which sequence to add in reference
- Impact of haplotype and decoy sequences
 
Gapped alignment / Ungapped alignment
- Whether the aligner allow deletions and insertions
 - The CIGAR string in bam file contains 
IandDoperation - Most aligners support gapped alignments, but note bowtie1 only support ungapped, so its output cannot be used for indel calling
 
DNA-seq alignment / RNA-seq alignment
- The major difference between DNA-seq mapping and RNA-seq mapping is RNA-seq aligner perform spliced alignment
 - For spliced aware aligner, the output bam file contains 
Nin cigar operation - Spliced-aware aligner usually accept genome annotation to define known splice junction
 - The mapping quality calculation in DNA-seq aligner and RNA-seq aligner can be different, as RNA-seq is seldom used for variant calling
 
Local alignment / End to end alignment
- The scoring is different
 - End to end alignment requires both end of the reads to align to the genome, while local alignment only requires a substring of reads to align to genome
 - Local alignment is more sensitive, but may lead to false hit
 - Local alignment produce so called soft-clipped segments if ends of the reads is not aligned, that is 
Scigar operation - If you don’t trim adaptor, you have to use local alignment, but for short reads (small RNA sequencing or riboseq for example) it can be dangerous
 
Multimapper
- Secondary alignment
 - The problem of biased assignment??
 - The uniqueness of read mapping is measured by the mapping quality, see Li Heng’s discussion at Mapping Uniqueness
 - different aligner may use different MAPQ scoring …
 - Also see http://biofinysics.blogspot.com/2014/05/how-does-bowtie2-assign-mapq-scores.html, http://biofinysics.blogspot.com/2014/05/the-slow-death-of-term-uniquely.html and https://www.biostars.org/p/270090/
 
Chimeric alignment
- Split reads
 - Supplementary alignment
    
- Note supplementary alignment and secondary alignment are different things
 
 
Unmapped reads
Paired end reads specific situation
- Discordant reads
    
- See some discussion here https://www.biostars.org/p/278412/
 
 - Singleton
 - Unmapped reads
 
Some parameter specification for reference
- ENCODE RNA-seq
 
#https://github.com/ENCODE-DCC/long-rna-seq-pipeline/blob/master/dnanexus/rampage/rampage-align-pe/resources/usr/bin/rampage_align_star.sh
STAR --genomeDir out --readFilesIn $read1_fq_gz $read2_fq_gz                         \
    --readFilesCommand zcat --runThreadN $ncpus --genomeLoad NoSharedMemory           \
    --outFilterMultimapNmax 500 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1        \
    --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04                   \
    --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000              \
    --outSAMheaderCommentFile COfile.txt --outSAMheaderHD @HD VN:1.4 SO:coordinate        \
    --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD      \
    --outFilterScoreMinOverLread 0.85 --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
    --clip5pNbases 6 15 --seedSearchStartLmax 30 --outSAMtype BAM SortedByCoordinate         \
    --limitBAMsortRAM ${ram_GB}000000000
- TCGA RNA-seq
 
# https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
### Step 1: Building the STAR index.*
STAR
--runMode genomeGenerate
--genomeDir <star_index_path>
--genomeFastaFiles <reference>
--sjdbOverhang 100
--sjdbGTFfile <gencode.v22.annotation.gtf>
--runThreadN 8
### Step 2: Alignment 1st Pass.
STAR
--genomeDir <star_index_path>
--readFilesIn <fastq_left_1>,<fastq_left2>,... <fastq_right_1>,<fastq_right_2>,...
--runThreadN <runThreadN>
--outFilterMultimapScoreRange 1
--outFilterMultimapNmax 20
--outFilterMismatchNmax 10
--alignIntronMax 500000
--alignMatesGapMax 1000000
--sjdbScore 2
--alignSJDBoverhangMin 1
--genomeLoad NoSharedMemory
--readFilesCommand <bzcat|cat|zcat>
--outFilterMatchNminOverLread 0.33
--outFilterScoreMinOverLread 0.33
--sjdbOverhang 100
--outSAMstrandField intronMotif
--outSAMtype None
--outSAMmode None
### Step 3: Intermediate Index Generation.
STAR
--runMode genomeGenerate
--genomeDir <output_path>
--genomeFastaFiles <reference>
--sjdbOverhang 100
--runThreadN <runThreadN>
--sjdbFileChrStartEnd <SJ.out.tab from previous step>
### Step 4: Alignment 2nd Pass.
STAR
--genomeDir <output_path from previous step>
--readFilesIn <fastq_left_1>,<fastq_left2>,... <fastq_right_1>,<fastq_right_2>,...
--runThreadN <runThreadN>
--outFilterMultimapScoreRange 1
--outFilterMultimapNmax 20
--outFilterMismatchNmax 10
--alignIntronMax 500000
--alignMatesGapMax 1000000
--sjdbScore 2
--alignSJDBoverhangMin 1
--genomeLoad NoSharedMemory
--limitBAMsortRAM 0
--readFilesCommand <bzcat|cat|zcat>
--outFilterMatchNminOverLread 0.33
--outFilterScoreMinOverLread 0.33
--sjdbOverhang 100
--outSAMstrandField intronMotif
--outSAMattributes NH HI NM MD AS XS
--outSAMunmapped Within
--outSAMtype BAM SortedByCoordinate
--outSAMheaderHD @HD VN:1.4
--outSAMattrRGline <formatted RG line provided by wrapper>
- TCGA DNA-seq
 
# https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/
bwa mem -t 8 -T 0 -R <read_group> <reference> <fastq_1.fq.gz> <fastq_2.fq.gz> | samtools view -Shb -o <output.bam> -
- arriba’s recommend parameter for fusion gene detection
 - 2021, Genome Research, Accurate and efficient detection of gene fusions from RNA sequencing data
 
#We ran STAR version 2.5.3a with the following parameters to align RNA-seq reads: 
STAR --outFilterMultimapNmax 1 --outFilterMismatchNmax 3 --outFilterMismatchNoverLmax 0.3 --alignIntronMax 500000 \
--alignMatesGapMax 500000 --chimSegmentMin 10 --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 \
--chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 \
--chimMainSegmentMultNmax 10
- STAR mapping for TE quantification in TEtranscript (handling multiple mapped reads)
    
- https://github.com/mhammell-laboratory/TEtranscripts/issues/69
 --outFilterMultimapNmax 100and--outAnchorMultimapNmax 100(alias –winAnchorMultimapNmax)- “we highly suggest users test multiple values of –winAnchorMultimapNmax to identify the optimal value for their experiment”
 
 
Alignment formats and APIs
Manipulate bam file
- Format specification of bam file
 - Flag in bam file
 - Unmapped reads
 
Duplication handling
- For different task, duplication should be ideally handled in different ways
    
- UMI is always favorable
 - Single end data and paired end data
 - In RNA-seq for gene expression quantification, better not remove duplicate is UMI is not present, see here for detail
 - For CLIP-seq, remember to collapse duplicate before peak calling
 
 - Tools
    
- Picard Markduplicates
 - samtools rmdup
        
- See following posts for differences between samtools rmdup and picard
            
- http://broadinstitute.github.io/picard/faq.html
                
- samtools rmdup do not remove interchromosomal duplicates
 
 - http://www.bio-info-trainee.com/2003.html
 
 - http://broadinstitute.github.io/picard/faq.html
                
 
 - See following posts for differences between samtools rmdup and picard
            
 - samblaster
 - sambamba