regex

regular

expressions

reg

ex

what is a regex?

regular

that define a
search pattern

expression

a sequence of characters

Matches text that follows a pattern :Not a pattern in the sense of a sequence, but in the sense of having something in common. What do these pair of strings have in common?
- test - test

- abc - xyz

- 123 - 456

- 1 - 99999999

- 1 - a

- a23 - =,&

A sequence of characters
that define a search pattern.

regular

expression

what are regexes useful for?

Purposes of regex

  • Searching for a match (finding stuff in text)
    • Segmentation
    • Validating input
    • QA automation
  • Replacing part of the text
    • Automatic or manual fixes
  • Entity extraction

Goals of this training

  • Make you less scared of using a computer to handle text (if you were!)
  • Make you more independent
  • Make you more capable (a bit closer to a power user)
  • Make you more aware of what can be done even if you're not sure how it's done
    • Ask for help

Non-goals

  • Turn you into regex champs

a few concepts

and definitions

text

string

character

expression

match

everything is a character

كل شئ حرفٌ

characters form strings

literal characters

metacharacters

backslash

literal characters

literal matches

# #

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789<>#@%&/,;:-_!

metacharacters

pattern matching

^$|.?*+()[]{}\
!=

backslash

escaping

backslash

The backslash can be used to:

  • escape metacharacters: ? vs. \?
  • assign a special meaning to some letters, e.g. \n

Exercise 1

two exercises

  • find asterisk in text
  • find text "\n"

 

let's do some quizzes

examples

the dot

the dot

  • Matches any character
    • including the line break, maybe!

character classes

character classes

  • A class is delimited by square brackets
  • A class is a set of characters, e.g. 
    • [aeiou] = the class of Latin vowels
    • [biu] = b, i and u
  • We can define ranges in a class using the dash:
    • [0-5] = numbers 0, 1, 2, 3, 4 and 5
    • [x-z] = [xyz]
    • [a-z0-9] = all lower case latin letters and numbers
  • Single metacharacters become literal inside a class
    • except... ] and - and ^

class names

  • Some sets of characters have a pre-defined name, using syntax [:cclass:]
    • [:alpha:] = [A-Za-z]
    • [:upper:] = uppercase alphabetic characters: [A-Z]
    • [:lower:] = lowercase alphabetic characters: [a-z]
    • [:digit:] = [0-9]
    • [:alnum:] = [A-Za-z0-9]
    • [:punct:] = punctuation marks: !"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~
    • [:space:] = whitespace: \t\r\n, form-feed, vertical-tab
    • [:blank:] = whitespace and tab characters
    • [:graph:] = visible characters (alnum + punct)
    • [:print:] = printable characters (graph + space)
    • [:xdigit:] = Hexadecimal characters: [0-9a-fA-F]
    • [:cntrl:] = control characters

##

character class shortcuts

quantifiers

Title Text

{n,m}

?

+

*

 

exercise

  • find figures: [0-9]+

negation

Title Text

  • [^...]
  • negative lookaround
  • capital case \S, \D, \W

execise

  • find text between parenthesis
  • find text between angle brackets

alternatives

Exercise after \d and |

  • michael, Michael, Mike, mike
  • Addresses: digit words Road or Street or abbrev

group constructs

anchors

metasequences

replacements

lookarounds

exercise

  • remove the angle brackets

Blocks, example

\p{InArabic}\(

find an opening parenthesis preceded by an Arabic character

flags

case sensitive

e.g. A and a are different characters! (different code points)

global

Subtitle

Named groups

(?<token>[\d]*)

(?P<year>(?:19|20)\d\d)(?P<delimiter>[- /.])(?P<month>0[1-9]|1[012])\2(?P<day>0[1-9]|[12][0-9]|3[01])

Exercises

  • [Qq]uestionnaire.*\.docx?$ <- add a list of files to the text sample
  • Bullet Two
  • Bullet Three
  •  

References

Books

  • Bullet One
  • Bullet Two
  • Bullet Three

if all that was unfathomable...

here's one last technique:

do not forget to cover examples sent by Valentina in a Word file (same as found in OmegaT's regex page)

Technical

  • https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
  • https://www.regular-expressions.info/quickstart.html

General public

  • https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions
  • https://regexcrossword.com/ <- want to play?