Regular Expressions

aka RegExp or Regex

What are they?

  • They are a way of finding a pattern in a string
  • Use cases:
    • check that something the user input/data was valid
    • You wanted to find things in a string
  • There are string methods to do some of these things for basic fixed cases
  • they're there because RegExp is hard for people BUT, once you get the hang of them, they are very powerful
  • RegExps are useful when you know that there's a pattern to something but you don't know the concrete value

Tools & Guides

Creating one

// /<pattern>/<flags>
const regex1 = /abc/gi;

// new RexExp(<pattern>, <flags>)
const regex2 = new RegExp('abc', 'gi');

// dynamically (You can't set flags using es6 
// template strings, so...)
const searchTerm = 'cat';
const flags = 'gi';
const regex3 = new RegExp(searchTerm, flags);

(The delimiting slashes are the same direction as JS comments)

Flags

  • Flags are a programming term for a[n often boolean]  value that adjusts how we run operations
  • There are several flags in regex:
    • g (global): search entire string
    • d (indices): The regex now returns the indices of the matches
    • i ([case] insensitive): ignore case when matching
    • s (single line): allows 'dotall' mode, where . matches \n
    • m (multi-line): treat a para as multiple lines not just 1 long line
    • u (unicode): use this with 'unicode' characters
    • y (sticky): Search from (inc.) the lastIndex
      • ​When you search globally js keeps a record of the index of the last match, called the lastIndex. If you search again it starts from AFTER that point, rather than the beginning.
  • As you can see, the implementation of Regex in JS is not a great one and some of these can be buggy!
  • Flags are immutable, once set. Cloning is the only option there.

Inspecting one...

const re = /cat/i;
re.source     //"^(\\d{3})(\\w+)$" or in this case 'cat'
re.lastIndex // 0
re.dotAll // false
re.flags // "i"
re.global // false
re.hasIndices // false
re.ignoreCase // true
re.multiline // false
re.sticky // false
re.unicode // false
  • * IF you run a regex method it will set lastIndex to the first character of the match. The next time you search it will go from there unless you reset to 0.
  • Browser support 99% there

Methods we can use

From 2 perspectives:

  • the RegExp methods:  
    • myRegex.​exec(str) - runs the expression and returns matches
    • myRegex.test(str) - returns true/false if a match is found
  • the String methods:  
  • The question is: Is the string the focus of the code (use string methods) or are you looking to build something you can use repeatedly on multiple string (use RegExp methods)

This next slide will show both ways around, in the future you can just presume that you can do it from both the string and the regex perspective.

Searching for a definite term

A regex for a concrete term (e.g. 'cat')

  • NOTE: How 'All' methods require a regex with the global flag!
  • note the | (OR) symbol in the last examples

Special/Reserved Characters

  • Some text characters have special meaning in Regex
  • If you meant the actual character then you'll have to escape them using a \ like we do for strings (e.g. 'don\'t')
    • \
    • /
    • [ ]
    • ( )
    • { }
    • ?
    • +
    • *
    • |
    • .
    • ^
    • $

Anchors

  • There are convenience methods for these, e.g. startsWith
  • You'll recognise from CSS
  • ^ line start (set 'm' flag for multi-line string start)
  • $ line end (set 'm' flag for multi-line string end)

Word Boundaries

  • Like ^ and $ but for words, not lines
  • \b is a word boundary character
'Hello everyone!'
 ^   ^ ^      ^
'Hello everyone!'
  ^^^   ^^^^^^
  • \B is a non-word boundary character
"Hello cat!".match(/\bcat/); // starts with 'cat'
"Hello cat!".match(/cat\b/); // ends with 'cat'
"Hello cat!".match(/\bcat\b/); // has 'cat' as whole word
"Scatter!".match(/\Bcat/); // has 'cat' within it

N.B. They do not work with non-latin alphabets

Searching for indefinite things

  • Sometimes we don't know the exact values we're looking for
  • Sometimes what we're looking for is a pattern, like an email address or phone number

Sets & Ranges

  • Indicated with [ ]
  • A set allows a selection of characters
    • [aeiou]
  • A range is as it sounds...
    • [a-z] (contains any lowercase letters)
    • [A-Z] (contains any uppercase letters)
    • [a-c] (contains any of 'a', 'b' or 'c')
    • [0-9] (contains any numbers)
    • [0-5] (contains any numbers between 0 and 5, inclusive)
    • [a-zA-Z0-9] (contains a combination)
    • [^A-C] (must NOT contain A-C. Note how ^ is INSIDE the range markers  [ ])
  • \d matches any digit  ([0-9])
    • \D the reverse ([^0-9])
  • \w any alphanumeric character, plus underscore [A-Za-z_0-9]
    • \W the reverse ([^A-Za-z_0-9])
  • \s any whitespace character: spaces, tabs, newlines 
    • \S the reverse
  • \0 matches null
  • \n matches a newline character
  • \t matches a tab character
  • \uXXXX matches a unicode character with code XXXX (requires the u flag)
  • . matches any character that is not a newline char (e.g. \n) (unless you use the s flag)
  • [^] matches any character, including newline characters. It’s for use on multiline strings

Meta characters:

  • Look at [0-9]$ (contains any numbers - last must be a number)
  • As soon as you add anchors to ranges you'll find that the range then matches just a single character
  • How do we say several [A-Z] characters?
    • + means 1 or more [A-Z]+
    • * means 0 or more [A-Z]*
  • Can we make chars or parts of the regex (aka 'atoms') optional?
    • ? means 0 or 1 [A-Z]?
  • Can we give a number of characters
    • [A-Z]{3}
  • or a range for that number?
    • [A-Z]{3,5} (NO whitespace: {3, 5} !== {3,5})
  • or say "at least n characters"
    • [A-Z]{3,}

Quantifiers

  • You can group the parts of the regex to
    • allow for their
      • repeated use (\d{3}-){2,3} or (\d{3}-)+
      • optional use (\d{3}-)?
    • to assign them an id ($1 = first match, $2 second, etc.)
      • You can now name them /(?<year>\d{4})/ (ES2018)
    • to allow you to extract the matches for that particular group
  • By default groups are capturing groups (which extract matches)
    • You can switch that off with ?: like (?:\s)
  • If there are no matches then certain methods return null
  • Otherwise they return an array with the full match, then the group matches in group order
    • If you don't have groups you get just the matches

Groups

  • Groups (cont...) 
  • To get matches we use one of:
  • By default regular expressions are greedy
  • They seek to match/return as much information as possible
  • When you do + (one or more) it will match all the way to the end of the last match it finds
  • You can switch this behaviour off with the ? operator after the +, so that it will stop after the first complete match
  • Doing that invokes laziness in the expression

Greed/laziness

  • Done with the use of:
    • str.replace(oldStr, newStr)
      • a string only replaces the first one! A global regex does all
    • str.replaceAll(oldStr, newStr)
  • We can swap values out of our strings
  • Instead of a string we can pass a regex to give more flexibility
    • N.B. must have the global flag!
    • for newStr we can have access to the ids of capture groups and use them ($1, $2, etc.)
    • or we can specify a function that gets
      • the match
      • the groups
      • the offset of the match from start of string
      • the whole string

Replacing

  • These allow you to match something if it is preceded or followed by something else
    • Lookahead
      • positive (?=) matches if <something> is followed by <something else>
      • negative (?!) matches if <something> is not followed by <something else>
    • Lookbehind
      • positive (?<=) matches if <something> is preceded by <something else>
      • negative (?<!) matches if <something> is preceded by <something else>
    • Warning: Safari does not support Lookbehinds

Look Ahead/Behind

  • Some Regexs can take a LOOOOONG time to execute
  • Sometimes this is used by hackers as a method of attack
  • Read this for slow queries
  • Read this for security
  • module

Danger, Will Robinson!!