• This post tries to record what I’ve heard of about the following questions.
    • Use which aligner, which parameter, to what reference sequence, to map what data
    • How to manipulate output of different aligners, or equivalently, how to manipulate bam files

Famous tools for pairwise sequence comparison

Short reads mapping: reference and parameters

Reads preprocessing

Adaptor trimming / quality trimming

  • If you perform global alignment, then adapter trimming is required

    Duplication removal / UMI handling

  • You may remove PCR duplicates before alignment or after alignment
    • Pros: reduce computational burden in reads mapping.
    • Cons: less flexible
      • post alignment duplication removal allows “mark duplicates” instead of “remove duplicates”

Short read aligners

  • What short read aligners do is actually assign some annotations to each reads, that is which segment of the reads are assigned to which location of the genome in which form

  • In different aligner, same thing may have different name

Map to genome / map to transcriptome

  • Seems most project prefer map reads to genome, but some downstream analysis tools (salmon, rsem) requires the reads to be aligned to transcriptome coordinate
  • STAR could direct project genome aligned reads to transcriptome coordinate

Which sequence to add in reference

  • Impact of haplotype and decoy sequences

Gapped alignment / Ungapped alignment

  • Whether the aligner allow deletions and insertions
  • The CIGAR string in bam file contains I and D operation
  • Most aligners support gapped alignments, but note bowtie1 only support ungapped, so its output cannot be used for indel calling

DNA-seq alignment / RNA-seq alignment

  • The major difference between DNA-seq mapping and RNA-seq mapping is RNA-seq aligner perform spliced alignment
  • For spliced aware aligner, the output bam file contains N in cigar operation
  • Spliced-aware aligner usually accept genome annotation to define known splice junction
  • The mapping quality calculation in DNA-seq aligner and RNA-seq aligner can be different, as RNA-seq is seldom used for variant calling

Local alignment / End to end alignment

  • The scoring is different
  • End to end alignment requires both end of the reads to align to the genome, while local alignment only requires a substring of reads to align to genome
  • Local alignment is more sensitive, but may lead to false hit
  • Local alignment produce so called soft-clipped segments if ends of the reads is not aligned, that is S cigar operation
  • If you don’t trim adaptor, you have to use local alignment, but for short reads (small RNA sequencing or riboseq for example) it can be dangerous

Multimapper

Chimeric alignment

  • Split reads
  • Supplementary alignment
    • Note supplementary alignment and secondary alignment are different things

Unmapped reads

Paired end reads specific situation

Some parameter specification for reference

  • ENCODE RNA-seq
#https://github.com/ENCODE-DCC/long-rna-seq-pipeline/blob/master/dnanexus/rampage/rampage-align-pe/resources/usr/bin/rampage_align_star.sh
STAR --genomeDir out --readFilesIn $read1_fq_gz $read2_fq_gz                         \
    --readFilesCommand zcat --runThreadN $ncpus --genomeLoad NoSharedMemory           \
    --outFilterMultimapNmax 500 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1        \
    --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04                   \
    --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000              \
    --outSAMheaderCommentFile COfile.txt --outSAMheaderHD @HD VN:1.4 SO:coordinate        \
    --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD      \
    --outFilterScoreMinOverLread 0.85 --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
    --clip5pNbases 6 15 --seedSearchStartLmax 30 --outSAMtype BAM SortedByCoordinate         \
    --limitBAMsortRAM ${ram_GB}000000000
  • TCGA RNA-seq
# https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/
### Step 1: Building the STAR index.*

STAR
--runMode genomeGenerate
--genomeDir <star_index_path>
--genomeFastaFiles <reference>
--sjdbOverhang 100
--sjdbGTFfile <gencode.v22.annotation.gtf>
--runThreadN 8

### Step 2: Alignment 1st Pass.

STAR
--genomeDir <star_index_path>
--readFilesIn <fastq_left_1>,<fastq_left2>,... <fastq_right_1>,<fastq_right_2>,...
--runThreadN <runThreadN>
--outFilterMultimapScoreRange 1
--outFilterMultimapNmax 20
--outFilterMismatchNmax 10
--alignIntronMax 500000
--alignMatesGapMax 1000000
--sjdbScore 2
--alignSJDBoverhangMin 1
--genomeLoad NoSharedMemory
--readFilesCommand <bzcat|cat|zcat>
--outFilterMatchNminOverLread 0.33
--outFilterScoreMinOverLread 0.33
--sjdbOverhang 100
--outSAMstrandField intronMotif
--outSAMtype None
--outSAMmode None

### Step 3: Intermediate Index Generation.

STAR
--runMode genomeGenerate
--genomeDir <output_path>
--genomeFastaFiles <reference>
--sjdbOverhang 100
--runThreadN <runThreadN>
--sjdbFileChrStartEnd <SJ.out.tab from previous step>

### Step 4: Alignment 2nd Pass.

STAR
--genomeDir <output_path from previous step>
--readFilesIn <fastq_left_1>,<fastq_left2>,... <fastq_right_1>,<fastq_right_2>,...
--runThreadN <runThreadN>
--outFilterMultimapScoreRange 1
--outFilterMultimapNmax 20
--outFilterMismatchNmax 10
--alignIntronMax 500000
--alignMatesGapMax 1000000
--sjdbScore 2
--alignSJDBoverhangMin 1
--genomeLoad NoSharedMemory
--limitBAMsortRAM 0
--readFilesCommand <bzcat|cat|zcat>
--outFilterMatchNminOverLread 0.33
--outFilterScoreMinOverLread 0.33
--sjdbOverhang 100
--outSAMstrandField intronMotif
--outSAMattributes NH HI NM MD AS XS
--outSAMunmapped Within
--outSAMtype BAM SortedByCoordinate
--outSAMheaderHD @HD VN:1.4
--outSAMattrRGline <formatted RG line provided by wrapper>
  • TCGA DNA-seq
# https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/
bwa mem -t 8 -T 0 -R <read_group> <reference> <fastq_1.fq.gz> <fastq_2.fq.gz> | samtools view -Shb -o <output.bam> -
#We ran STAR version 2.5.3a with the following parameters to align RNA-seq reads: 
STAR --outFilterMultimapNmax 1 --outFilterMismatchNmax 3 --outFilterMismatchNoverLmax 0.3 --alignIntronMax 500000 \
--alignMatesGapMax 500000 --chimSegmentMin 10 --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 \
--chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 \
--chimMainSegmentMultNmax 10
  • STAR mapping for TE quantification in TEtranscript (handling multiple mapped reads)
    • https://github.com/mhammell-laboratory/TEtranscripts/issues/69
    • --outFilterMultimapNmax 100 and --outAnchorMultimapNmax 100 (alias –winAnchorMultimapNmax)
    • “we highly suggest users test multiple values of –winAnchorMultimapNmax to identify the optimal value for their experiment”

Alignment formats and APIs

Manipulate bam file

  • Format specification of bam file
  • Flag in bam file
  • Unmapped reads

Duplication handling