Categories
Mastering Development

Match regex with overlap for DNA

I am trying to match DNA sequences that begin at the beginning or a multiple of 3 letters from the beginning, and start with either ATG or CGA, followed by 6,9,12,15,… letters and ending in AGT. The following code only gets one of the matches (the longest one). I have looked into "positive lookaheads" (e.g. ?=) but could not figure out how to successfully apply it to this situation.

dna=c("ABCATGABCGAAADFAGTAAAAGTAGTAAAGT")
str_match_all(dna, "^(...)*((?:ATG|CGA)(?:...){2,}(?:AGT))")

[[1]]
     [,1]                          [,2]  [,3]      
[1,] "ABCATGABCGAAADFAGTAAAAGTAGT" "ABC" "ATGABCGAAADFAGTAAAAGTAGT"

Desired:
ABCATGABCGAAADFAGT ABC ATGABCGAAADFAGT
ABCATGABGCGAADFAGTAAAAGT ABC ATGABGCGAADFAGTAAAAGT
ABCATGABGCGAADFAGTAAAAGTAGT ABC ATGABGCGAADFAGTAAAAGTAGT

Leave a Reply

Your email address will not be published. Required fields are marked *