regex

regular

expressions

reg

ex

what is a regex?

regular

that define a
search pattern

expression

A sequence of characters

Matches text that follows a pattern :Not a pattern in the sense of a sequence, but in the sense of having something in common. What do these pair of strings have in common?
- test - test

- abc - xyz

- 123 - 456

- 1 - 99999999

- 1 - a

- a23 - =,&

A sequence of characters
that define a search pattern.

regular

expressions

what are regexes useful for?

Purposes of regex

  • Searching for a match (finding stuff in text)
    • Segmentation
    • Validating input
    • QA automation
  • Replacing part of the text
    • Automatic or manual fixes
  • Entity extraction

Goals of this training

  • Make you less scared of using a computer to handle text (if you were!)
  • Make you more independent
  • Make you more capable (a bit closer to a power user)
  • Make you more aware of what can be done even if you're not sure how it's done
    • Ask for help

Non-goals

  • Turn you into regex champs

everything is a character

characters form strings

literal characters

metacharacters

backslash

literal characters

literal matches

metacharacters

pattern matching

backslash

escaping

two exercises

  • find asterisk in text
  • find text "\n"

 

examples

the dot

the dot

  • Matches any character
    • including the line break, or not

character classes

character class shortcuts

quantifiers

Title Text

{n,m}

?

+

*

 

exercise

  • find figures: [0-9]+

negation

Title Text

  • [^...]
  • negative lookaround
  • capital case \S, \D, \W

execise

  • find text between parenthesis
  • find text between angle brackets

alternatives

Exercise after \d and |

  • michael, Michael, Mike, mike
  • Addresses: digit words Road or Street or abbrev

group constructs

anchors

metasequences

replacements

lookarounds

exercise

  • remove the angle brackets

Blocks, example

\p{InArabic}\(

find an opening parenthesis preceded by an Arabic character

flags

case sensitive

e.g. A and a are different characters! (different code points)

global

Subtitle

Named groups

(?<token>[\d]*)

(?P<year>(?:19|20)\d\d)(?P<delimiter>[- /.])(?P<month>0[1-9]|1[012])\2(?P<day>0[1-9]|[12][0-9]|3[01])

Exercises

  • [Qq]uestionnaire.*\.docx?$ <- add a list of files to the text sample
  • Bullet Two
  • Bullet Three
  •  

References

Books

  • Bullet One
  • Bullet Two
  • Bullet Three

if all that was unfathomable...

here's one last technique:

do not forget to cover examples sent by Valentina in a Word file (same as found in OmegaT's regex page)

Technical

  • https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
  • https://www.regular-expressions.info/quickstart.html

General public

  • https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions
  • https://regexcrossword.com/ <- want to play?

regex_blue

By msoutopico

regex_blue

  • 155