Correct offset of characters with extra white spaces, newlines, etc

I am trying to implement a simple solution for offset correction due to ‘junk’ characters, for example:

string_1 = "London is the capital of the UK"

-> chars location: ("capital", 14,21) and ("UK", 29, 31)

however, in the presence of newlines etc, the char location are changed:

string_2 = "London is the\n\ncapital of the\n UK"

-> chars location ("capital" are (21,28)), ("UK" are (36,38))

moreover, the string may contain any number of newlines and other artefacts as well as any number of key-words.

My question is, given a text with extra characters (ASCII, newlines, etc), and some cleaning function, how to adjust the locations of certain keywords in the cleaned text?

string_2 = cleaning_txt(string_1)

-> ("capital", 14, 21) --> ("capital", 21, 28)

Leave a Reply

Your email address will not be published. Required fields are marked *