Regex comment matching


COM
Member
Registered: 11.09.11 00:50
Timezone: UTC +1
Posts: 10

Not too long ago I ran into a situation where I would've liked to strip the contents of a file from its comments and strings. Actually, it wasn't just one file, it was a shitload of them as part of an automated process.

This isn't the first time I've had to strip out comments and/or strings. So I figured that it would be nice to solve it with regular expressions and not have to do a lot of manual if checking and such, just a nice, concise, easily reusable regex.

Unfortunately, the suggestions I found while googling were inadequate, not taking into account things like comments in strings, strings in comments, escaping of closing quotes, etc.
So I wrote my own.

The only requirement on the regexes here is that the code isn't interpreted as multiline, but one long string and that the regex engine used is up to date enough to handle magic like \K.

For C++/Java style code (should also handle C):

(?:(?:(["'])(?:.|[\r\n])*?(?<!\\)(?:\\\\)*\1)\K|(?://.*)|(?:/\*(?:.|[\r\n])*?\*/))

PHP is a mean bastard of a language and will probably require two parsings since it can turn PHP on or off at will.
The regex is certain to work if and only if the code you're matching against is entirely PHP (and not mixed):

(?:(?:(["'])(?:.|[\r\n])*?(?<!\\)(?:\\\\)*\1)\K|(?:<<<(?:('?)([\w\d]+)\2))(?:.|[\r\n])*(?<=[\r\n])\3(?=;?[\r\n])\K|(?:(?://|#).*?(?=(?:[\r\n]|\?>|$)))|(?:/\*(?:.|[\r\n])*?\*/))

These regexes should match any comments (and strings, with some very minor modification) and will allow easy removal of them with a regex replace.

Remember that you will probably have to escape these regexes again if you're using them, for instance, in PHP or a similar language.
As mentioned, they are currently set to leave strings alone and only match comments, but that is easily modified (just remove whichever \K you're particularly not fond of).

Feel free to add regex for other languages or tell me if I've missed something, things may have slipped by my testing after all.


COM
Member
Registered: 11.09.11 00:50
Timezone: UTC +1
Posts: 10

Not too long ago I ran into a situation where I would've liked to strip the contents of a file from its comments and strings. Actually, it wasn't just one file, it was a shitload of them as part of an automated process.

This isn't the first time I've had to strip out comments and/or strings. So I figured that it would be nice to solve it with regular expressions and not have to do a lot of manual if checking and such, just a nice, concise, easily reusable regex.

Unfortunately, the suggestions I found while googling were inadequate, not taking into account things like comments in strings, strings in comments, escaping of closing quotes, etc.
So I wrote my own.

The only requirement on the regexes here is that the code isn't interpreted as multiline, but one long string and that the regex engine used is up to date enough to handle magic like \K.

For C++/Java style code (should also handle C):

(?:(?:(["'])(?:.|[\r\n])*?(?<!\\)(?:\\\\)*\1)\K|(?://.*)|(?:/\*(?:.|[\r\n])*?\*/))

PHP is a mean bastard of a language and will probably require two parsings since it can turn PHP on or off at will.
The regex is certain to work if and only if the code you're matching against is entirely PHP (and not mixed):

(?:(?:(["'])(?:.|[\r\n])*?(?<!\\)(?:\\\\)*\1)\K|(?:<<<(?:('?)([\w\d]+)\2))(?:.|[\r\n])*(?<=[\r\n])\3(?=;?[\r\n])\K|(?:(?://|#).*?(?=(?:[\r\n]|\?>|$)))|(?:/\*(?:.|[\r\n])*?\*/))

These regexes should match any comments (and strings, with some very minor modification) and will allow easy removal of them with a regex replace.

Remember that you will probably have to escape these regexes again if you're using them, for instance, in PHP or a similar language.
As mentioned, they are currently set to leave strings alone and only match comments, but that is easily modified (just remove whichever \K you're particularly not fond of).

Feel free to add regex for other languages or tell me if I've missed something, things may have slipped by my testing after all.


Senzon
Member
Registered: 25.05.12 19:13
Timezone: UTC +3
Posts: 17

Nice, looks like a pretty useful regular expression. Thanks for sharing! smile


Senzon
Member
Registered: 25.05.12 19:13
Timezone: UTC +3
Posts: 17

Nice, looks like a pretty useful regular expression. Thanks for sharing! smile