Reg(ular )?Ex(pressions)?

Advanced

Last time, we learned the minimum required to use regular expression.

One thing I should have mentioned if you're still intimidated

 

 


Remember the first time you looked at code? Did it look like a giant pile of text without any meaning?

Well, you can learn to read/write regexes as you learned to code!

Replacing

Regular expressions are not only used for matching, but replacing is a very important feature too. Let's have an example.

https://regex101.com/r/QyDVie/1

Rule of thumb: () will create a capturing group, and what is "captured" can be reused in a "replacing expression" by referencing to it via $[group number]

Replacing

Named grouping (?<name>.*)

https://regex101.com/r/vYosgh/1

"Ignored" grouping (if you want to apply a quantifier but not use the group) (?:
https://regex101.com/r/NNf0Ns/1

Reuse capture in matching 🤯
https://regex101.com/r/MiW0fF/1

Last time we learnt about the [] keyword, that allows to match a range of characters.
There are some handy shortcuts for some of those:

 

\b matches "word boundary" - it matches beginning or end of word, similar to ^ and $ for lines

\s and \S matches any space character and any non-space character respectively

\d and \D are for digits,

\w and \W are for "word" characters, i.e. [0-9a-zA-Z]

Character classes

Quantifiers are the *, +, ? and {} characters as seen last session.
Their default behavior is to be greedy; they'll try to match as many characters as possible
Appending ? to a quantifier will make it non-greedy (or lazy)

Greediness of quantifiers

Greedy (default)

Lazy (append ?)

Laziness and greediness

And a somewhat practical use case:
https://regex101.com/r/jFGSyi/3

Let's see how each quantifier behave with the lazy flag:
https://regex101.com/r/LQ0tYW/1

Lookaround allows you to match based on context, without actually matching the context. Of course this is particularly useful if you want to use the matches.

Lookahead/Lookbehind

Positive lookahead (?= mean "that is followed with"
https://regex101.com/r/fid3Dr/4

Lookahead/lookbehind

Negative lookahead (?! mean "that is NOT followed with"

https://regex101.com/r/3WlsaA/1

Positive lookbehind (?<= mean "that is preceded with"
https://regex101.com/r/VdAeWG/1

Negative lookbehind (?<! mean "that is NOT preceded with"

https://regex101.com/r/LgI6GG/1

Lookbehind usually comes with a performance cost as more steps are required by the parser.

Patterns can be "configured" to be case insensitive, allow matching of new lines, ignore spaces, ...

This varies a fair bit by language, here's a reference for Java

https://www.logicbig.com/tutorials/core-java-tutorial/java-regular-expressions/regex-embedded-flags.html

And Javascript

https://www.codeguage.com/courses/regexp/flags

 

Note that Javascript doesn't support the "comment" flag - you can build your regex using standard string concatenation to add comments.

Pattern modifiers

Made with Slides.com