regex

regular
expressions
reg
ex
what is a regex?
regular
that define a
search pattern
expression
a sequence of characters
Matches text that follows a pattern :Not a pattern in the sense of a sequence, but in the sense of having something in common. What do these pair of strings have in common?
- test - test
- abc - xyz
- 123 - 456
- 1 - 99999999
- 1 - a
- a23 - =,&
A sequence of characters
that define a search pattern.
regular
expression


what are regexes useful for?


Purposes of regex
- Searching for a match (finding stuff in text)
- Segmentation
- Validating input
- QA automation
- Replacing part of the text
- Automatic or manual fixes
- Entity extraction
Goals of this training
- Make you less scared of using a computer to handle text (if you were!)
- Make you more independent
- Make you more capable (a bit closer to a power user)
- Make you more aware of what can be done even if you're not sure how it's done
- Ask for help
Non-goals
- Turn you into regex champs
a few concepts
and definitions
text
string
character
⇒
⇑
expression
⇓
match
everything is a character
كل شئ حرفٌ
characters form strings
literal characters
metacharacters
backslash
literal characters
literal matches
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789<>#@%&/,;:-_!
metacharacters
pattern matching
^$|.?*+()[]{}\
!=
backslash
escaping
backslash
The backslash can be used to:
- escape metacharacters: ? vs. \?
- assign a special meaning to some letters, e.g. \n
Exercise 1
two exercises
- find asterisk in text
- find text "\n"
let's do some quizzes


examples
the dot

the dot
- Matches any character
- including the line break, maybe!
character classes
character classes
- A class is delimited by square brackets
- A class is a set of characters, e.g.
- [aeiou] = the class of Latin vowels
- [biu] = b, i and u
- We can define ranges in a class using the dash:
- [0-5] = numbers 0, 1, 2, 3, 4 and 5
- [x-z] = [xyz]
- [a-z0-9] = all lower case latin letters and numbers
- Single metacharacters become literal inside a class
- except... ] and - and ^
class names
- Some sets of characters have a pre-defined name, using syntax [:cclass:]
- [:alpha:] = [A-Za-z]
- [:upper:] = uppercase alphabetic characters: [A-Z]
- [:lower:] = lowercase alphabetic characters: [a-z]
- [:digit:] = [0-9]
- [:alnum:] = [A-Za-z0-9]
- [:punct:] = punctuation marks: !"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~
- [:space:] = whitespace: \t\r\n, form-feed, vertical-tab
- [:blank:] = whitespace and tab characters
- [:graph:] = visible characters (alnum + punct)
- [:print:] = printable characters (graph + space)
- [:xdigit:] = Hexadecimal characters: [0-9a-fA-F]
- [:cntrl:] = control characters
##
character class shortcuts
quantifiers
Title Text
{n,m}
?
+
*
exercise
- find figures: [0-9]+
negation
Title Text
- [^...]
- negative lookaround
- capital case \S, \D, \W
execise
- find text between parenthesis
- find text between angle brackets
alternatives
Exercise after \d and |
- michael, Michael, Mike, mike
- Addresses: digit words Road or Street or abbrev
group constructs
anchors
metasequences
replacements
lookarounds
exercise
- remove the angle brackets
Blocks, example
\p{InArabic}\(
find an opening parenthesis preceded by an Arabic character
flags
case sensitive
e.g. A and a are different characters! (different code points)
global
Subtitle
Named groups
(?<token>[\d]*)
(?P<year>(?:19|20)\d\d)(?P<delimiter>[- /.])(?P<month>0[1-9]|1[012])\2(?P<day>0[1-9]|[12][0-9]|3[01])
Exercises
- [Qq]uestionnaire.*\.docx?$ <- add a list of files to the text sample
- Bullet Two
- Bullet Three
References
- https://www.regexmagic.com (regex generator)
- http://regular-expressions.com (complete reference)
- https://regexr.com (js regex tester)
- https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf (complete tutorial)
Books
- Bullet One
- Bullet Two
- Bullet Three

if all that was unfathomable...
here's one last technique:
do not forget to cover examples sent by Valentina in a Word file (same as found in OmegaT's regex page)
Technical
- https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
- https://www.regular-expressions.info/quickstart.html
General public
- https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions
- https://regexcrossword.com/ <- want to play?
regex training
By msoutopico
regex training
- 314