Categories
Linux Mastering Development

sed: extract text within pattern that occurs an arbiratry number of times in a line

I need to extract a part of string that may appear 1 to n times in each line.

For instance, this would reflect what I need:

This [dbo].[something] is a text containing [dbo].[something_else], then okay?
And then, [dbo].[something] may appear just once.
But why, nothing prevents [dbo].[something] from appearing twice as [dbo].[something] here.
And then can be three times, as [dbo].[something] is [dbo].[anything] but [dbo].[elsewhere] here.
[dbo].[otherthing] depicts another scenario with just one and pattern heading line
Or, also [dbo].[ultra] with an arbitrary amount of [dbo].[references] but ending with [dbo].[pattern]

As you may have noticed, the pattern would be \[dbo\]\.\[[^]]+\]. For instance, from the text above, I would want a result of:

something something_else
something
something something
something anything elsewhere
otherthing
ultra references pattern

Then I can just inline everything (or append to a bash array) and filter duplicates, this shouldn’t be an issue. I am just having trouble to figure out how to do this filter in a single sweep.

What I have here, results in extracting just the last match (it is obvious why when you are used to sed’s “greedy” approach to pattern matching):

cat dborefs.txt | sed -E "s/(.*\[dbo\]\.\[([^]]+)\].*)*/\2/g"
something_else
something
something
elsewhere
otherthing
pattern

I could extract, then replace the patterns so that they no longer match, then extract again until I get no more matches, but that sounds just too cumbersome, all bash overhead considered; it would be best to be able to extract everything in a single call to sed. I feel this should be possible, just can’t easily figure out how. Thinking this may be useful for others, I felt like sharing the matter here could prove fruitful for the community.

Leave a Reply

Your email address will not be published. Required fields are marked *