On Masking of Low Complexity Sequence
Handle low complexity sequence
- masking low complexity regions in a long sequence is sometimes favorable for database searching and motif finding, see http://web.mit.edu/meme_v4.11.4/share/doc/glam2_tut.html
- filter read with low sequence complexity in NGS data
Method
- entropy
- The SEG algorithm
- entropy of 3-mer in sliding window
- dust score of 3 mer in sliding window
- perform local alignment to sequence itself and known repeative elements
Tools
-
repeatmasker, screens DNA sequences for interspersed repeats and low complexity DNA sequences
-
NCBI’s blast is shipped with multiple masking tools, see https://www.ncbi.nlm.nih.gov/books/NBK569845/ for detail
- segmasker
- dustmasker
- windowmasker
# example usage of dustmasker
dustmasker -outfmt fasta -parse_seqids -in {input.fasta} -out {masked.fasta}
# low complexity sequence set set to lower case, if you want to set it to N, run
sed '/^>/! s/[[:lower:]]/N/g' {masked.fasta} > {hardmasked.fasta} # see https://www.biostars.org/p/13677/s
- dust, a program shipped with MEME suites
dust sequences.fasta {cutoff} > sequences.masked.fasta
-
NCBI’s seg, segment sequence by local complexity. A very old tool …
-
sequence complexity (entropy and dust) based read filter in prinseq, also see https://chipster.csc.fi/manual/prinseq-complexity-filter.html
-
I write a small script that assign a single 3-mer count based dust or entropy score to each reads in fastq file, see https://github.com/uaauaguga/NGS-Analysis-Notes/blob/master/scripts/sequence-complexity.py
-
2011, NAR, A new repeat-masking method enables specific detection of homologous sequences