Split Sequencing Reads
- We may want to filter some unwanted sequences from fastq file prior to down stream analysis
- rRNA for RNA seq data
- host reads for metagenomic data
Remove rRNA sequence
-
For single species RNA-seq, simply map reads to rRNA reference of these species, keep unaligned reads
-
For metagenomic data, may try sortmerna
# download rRNA reference from their github repo https://github.com/biocore/sortmerna/tree/master/data/rRNA_databases
# build sortmerna index
sortmerna -ref reference/rRNA/fasta/rRNA.fasta -index 1 --idx-dir reference/rRNA/sortmerna-index
# -index 1: build index and exit
# run sortmerna
sortmerna -ref reference/rRNA/fasta/rRNA.fasta --idx-dir reference/rRNA/sortmerna-index --reads {input.fastq_1} --reads {input.fastq_2} --fastx --workdir {working_dir} --index 0 --out2 --other --threads {threads}
# -index 0: perform rRNA matching on existed index
- For annotate rRNA sequence in assembled sequence contigs of bacteria genome , try barrnap
Remove host genome sequence
-
align reads to host genome
- bmtagger
- Used in HMP project, see https://www.hmpdacc.org/hmp/doc/HumanSequenceRemoval_SOP.pdf
- bbduk.sh in bbmap suites