On Masking of Low Complexity Sequence

Handle low complexity sequence

masking low complexity regions in a long sequence is sometimes favorable for database searching and motif finding, see http://web.mit.edu/meme_v4.11.4/share/doc/glam2_tut.html
filter read with low sequence complexity in NGS data

Method

entropy
- The SEG algorithm
- entropy of 3-mer in sliding window
dust score of 3 mer in sliding window
- see https://kodomo.fbb.msu.ru/FBB/year_09/ppt/DUST.pdf
perform local alignment to sequence itself and known repeative elements

Tools

repeatmasker, screens DNA sequences for interspersed repeats and low complexity DNA sequences
NCBI’s blast is shipped with multiple masking tools, see https://www.ncbi.nlm.nih.gov/books/NBK569845/ for detail
- segmasker
- dustmasker
- windowmasker

# example usage of dustmasker
dustmasker -outfmt fasta -parse_seqids -in {input.fasta} -out {masked.fasta}
# low complexity sequence set set to lower case, if you want to set it to N, run
sed '/^>/! s/[[:lower:]]/N/g' {masked.fasta} > {hardmasked.fasta} # see https://www.biostars.org/p/13677/s

dust, a program shipped with MEME suites

dust sequences.fasta {cutoff} > sequences.masked.fasta

NCBI’s seg, segment sequence by local complexity. A very old tool …
sequence complexity (entropy and dust) based read filter in prinseq, also see https://chipster.csc.fi/manual/prinseq-complexity-filter.html
I write a small script that assign a single 3-mer count based dust or entropy score to each reads in fastq file, see https://github.com/uaauaguga/NGS-Analysis-Notes/blob/master/scripts/sequence-complexity.py
2011, NAR, A new repeat-masking method enables specific detection of homologous sequences