• We may want to filter some unwanted sequences from fastq file prior to down stream analysis
    • rRNA for RNA seq data
    • host reads for metagenomic data

Remove rRNA sequence

  • For single species RNA-seq, simply map reads to rRNA reference of these species, keep unaligned reads

  • For metagenomic data, may try sortmerna

# download rRNA reference from their github repo https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases
# build sortmerna index
sortmerna -ref reference/rRNA/fasta/rRNA.fasta -index 1 --idx-dir reference/rRNA/sortmerna-index 
# -index 1: build index and exit

# run sortmerna
 sortmerna -ref reference/rRNA/fasta/rRNA.fasta --idx-dir reference/rRNA/sortmerna-index --reads {input.fastq_1} --reads {input.fastq_2} --fastx --workdir {working_dir} --index 0 --out2 --other --threads {threads}
# -index 0: perform rRNA matching on existed index
  • For annotate rRNA sequence in assembled sequence contigs of bacteria genome , try barrnap

Remove host genome sequence