Patterns

https://slides.com/djjr/patterns-inf113-f21w22

Patterns

https://slides.com/djjr/patterns

Playlist for Patterns

Notes (Google drive)

Image Gallery

Resources

What to Learn

Read

Chapter in Eloquent JavaScript

Define "pattern"

Discuss & Combine

https://docs.google.com/presentation/d/15zbp_2s1eQB_XuGxAPgVihz1bKETURHgHz90Mmkhyk4/edit?usp=sharing

A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. Any of the senses may directly observe patterns.

-Wikipedia

and Text

Bricks,

Tiles,

Fun with Patterns

are

Tiles

Fun

Truchet Tiles

...are square tiles with patterns that are rotationally asymmetric. The plane can be tiled with them in different orientations to form wildly varied patterns.

Escher Tiles

Tiles

Go to Deck

From Creating to Recognizing
PATTERN

Pseudocode and Flowchart Approach to Pattern Recognition

For a different six brick pattern the flow would be the same, but the conditions in each diamond would be different...

function find (pattern, inWhat){ 
  return found (t/f) &  location
}

imagine a function definition

abstract to a black box

pattern

in what?

found?

location

r  R   r  R   R   R   R   R
 R  R   p  P  p  R  p  R   p
R  p r  R  r   R  p  R  p  R
p p  R  p  R   p  R  p  R  p
  R  p  R   p  R  p  R   p  R
R  p  R  p  R  r  R  r  R   r
r  R  p R   r  R  r  R   r  R
R p  R  r  R   p  R  r  R   p
p R   p  R  p  R  p  R  r  R
R  p  R  r  R  p   R  p  R  p
  R  r p R  r   R  p   R  p
 r  p  R  p  R  r  R   p  R

function search
  get the pattern
  get the inWhat text
  call find(pattern, inWhat)
  print results

function find (p,t)
  loop over entire text
    note current start
    //attempt match from here
    loop from start up to length of pattern
      if char no match next quit attempt 
      look at next character
      if length of pattern reached 
        match=TRUE
        note location
     next start of attempted match

i < t.length?

i =0

scan from t[i]

i++

report

start=i

j=0

j < p.length?

t[start+j]==p[j]?

j++

j=p.length+1

j=p.length?

match!

i < t.length?

i =0

scan from t[i]

i++

report

start=i

j=0

j < p.length?

t[start+j]==p[j]?

j++

j=p.length+1

j=p.length?

match!

But also, sometimes, choice, and flexibility.

Patterns are about sequence and repetition.

one black square followed by zero to three pairs of orange squares

STOP+THINK:
What's the pattern?

judgment
lodgment
acknowledgment

queue
blagueur
liqueur

555 432 1234

555-432-1234

(555)432-1234

(555) 432-1234

555.432.1234

What are some words in English that contain the sequence dgm or ueu?

Are these valid phone numbers?

Does your password meet these criteria?

Sometimes we want to use patterns to find things and sometimes we want to use patterns to constrain things.

5554321234

555-432-1234

(555)432-1234

(555) 432-1234

555.432.1234

555 432 1234

WANTED: An easy to use notation to describe any pattern in text

1 Specify SETS of characters

Introduce Notation:

Square brackets around a set to say "one of these"

[0123456789] means "a digit"

Introduce Notation:

A dash means "all the characters in between"

[0-9] means "a digit"

3
digits

2
digits

digit except 0,1

4
digits

2. Specify REPETITION

Introduce Notation:

Number in curly brackets means repeat

[0-9]{3} means "3 digits"

Introduce Notation:

Two numbers in {} means "at least" and "at most".

[0-9]{2,5} means at least 2 and at most 5 digits

3
digits

2
digits

digit except 0,1

4
digits

[2-9]

[0-9]{3,} means at least 3 digits

[0-9]{,5} means up to 5 digits

[2-9]

[0-9]{2}

[0-9]{3}

[0-9]{4}

STOP+THINK

Match

[A-Z]{4}

[A-Z][a-z]{3}

([A-Z][a-z]{,6} ){2,3}

four letter upper case words

four letter words

[a-z]{4}

four letter words with initial cap

full names with each component up to 7 letters and two or three components

STOP+THINK

Dino Valenti

Roy j Sims

([A-Z][a-z]{,6} ){2,3}

Franklin D Rooster

A Malick Obrien

Anna

Dino Valentino

Shorthand for [0-9] is \d

that is \d means "any digit"

Shorthand

"backslash d"

\t matches a tab

\n matches a new line

\r matches carriage return

\f matches a page break

\s matches all of these and SPACE

And...

\s = [ \t \n \r \f]

adding a backslash is called "escaping" because it allows a character to "escape" being interpreted by RegEx as it usually would be.

Specify OPTIONS and ALTERNATIVES

How to match 555 432 1234 OR 5554321234

[2-9][0-9]{2}(\s|-)?[0-9]{3}(\s|-)?[0-9]{4}

[2-9][0-9]{2}\s[0-9]{3}\s[0-9]{4}

STOP+THINK

[2-9][0-9]{2}[0-9]{3}[0-9]{4}

[2-9][0-9]{2}\s?[0-9]{3}\s?[0-9]{4}

notation: y? means 0 or 1 pattern y

notation: (x|y) means pattern x or pattern y

STOP+THINK

How about 555-432-1234 too?

REVIEW CHARACTER SETS

Patterns are defined by sets of characters

[a-z]

[abc]

[0-9a-f]

[A-z]

denotes any character from ASCII 'a' to ASCII 'z'. The hyphen indicates a range.
denotes any of the characters 'a', 'b', or 'c'

denotes characters used for hexadecimal numbers. To be strict use [0-9a-fA-F].A-z] 'A' is hex 41, 'Z' is hex 5A. 'a' is hex k and 'z' is hex l. The characters in between are [ \ ] ^ _ `

Examples

[A-Fa-f]

[^AEIOUaeiou]

[02468]

hex characters

English consonants

Even digits

Summary So Far

[a-z] is a range of characters

? means zero or one of the preceding

{m,n} specifies numbers of repetitions of preceding

\s, \d

3. Specify Boundaries

STOP+THINK

What if I want to only match a pattern at the beginning or end of a word?

Find "abc" in "abcarnia" and "jabc" but not "mabcam"

notation: \b means a word boundary

\babc matches "abcarnia" but not "jabc" and "mabcam"

abc\b matches "jabc" but not "abcarnia" and "mabcam"

[2-9]\d{2}(\s|-)? \d{3}(\s|-)?\d{4}

exactly 2 of these

0 or 1 of these

any digit

space or dash

0 or 1 of these

exactly 4 of these

any digit except 0,1

any digit

space or dash

exactly 3 of these

any digit

Stop + Think

US postal codes (zipcodes) can be either 5 digits or 9 digits with a dash separating the two groups. Which expression describes this pattern?

\d{5}-\d{4}

5 digits required

optional hypen

optional 4 digits

\d{5}(-\d{4})?

\d{5}-?(\d{4})?

\d{9}

Stop + Think

US postal codes (zipcodes) can be either 5 digits or 9 digits with a dash separating the two groups. Which expression describes this pattern?

\d{9}

\d{5}-\d{4}

\d{5}(-\d{4})?

\d{5}-?(\d{4})?

missing dash

could have dash without +4 or +4 without dash

no option for only 5

Stop + Think

Sometimes my ego gets the best of me and I want to search for my name in a document. But sometimes I am referred to by different forms of my name. I want to find Dan Ryan, Daniel Ryan, Danny Ryan, Dan J Ryan, Dan J Ryan Jr, and all the various combinations of these.

Dan(iel|ny)?\sJ?\sRyan\sJr?

Dan(iel|ny)?J?RyanJr?

Dan(iel|ny)?(\sJ)?\sRyan(\sJr)?

Dan(iel|ny)\sJ\sRyan\sJr

Stop + Think

Dan(iel|ny)?\sJ?\sRyan\sJr?

Dan(iel|ny)?J?RyanJr?

Dan(iel|ny)?(\sJ)?\sRyan(\sJr)?

Dan(iel|ny)\sJ\sRyan\sJr

? only applies to "r"

? only applies to "J" so we get 2 spaces

no spaces

missing ?s
no option to omit

Regular Expressions

aka

REGEX

A regular expression is a description of a linear textual pattern using a well-defined syntax

RegEx in JS

Constructing Regular Expressions

regular expression literal

call the constructor function

let re = /ab+c/;

let re = new RegExp('ab+c');

Pascal Case

Anatomy of Regular Expressions

character classes

groups
and
ranges

assertions

abc

\d ... any digit

\D ... any character that is not a digit

\w ... any character that can be in a word

\W ... any character that is not a word character

\s ... any single white space character

\S ... a single character other than white space

quantifiers

simple pattern

flags

x|y ... x or y

[xyz] ... x, y, or z

[x-z] ... character from x to z

[^xyz] ... not one of x, y, z

[^x-z] ... not character from x to z

(x) ... recognize X and remember

x? ... zero or 1 x

x+ ... one or more x

x* ... zero or more x

x{m} ... m x's

x{m,n} ... at least m, at most n, x's

x{,n} ... at most n x's

^ ... start of input

$ ... end of input

\b ... word boundary

\B ... not word boundary

x(?=y) ... x if followed by y

x(?!y) ... x if not followed by y
(?<=y)x ... x preceded by y

(?<!y)x ... x not preceded by y

i ... ignore case

g ... global (all matches)

m ... multiline

escaping

to use a special character literally (actually searching for a "*", for instance), "escape" it by putting a backslash in front of it

JUMP TO USING

FLAGS

i ... ignore case

g ... global (all matches)

m ... multiline

^ ... start of input

$ ... end of input

x? ... zero or 1 x

x+ ... one or more x

x* ... zero or more x

[^xyz] ... not one of x, y, z

[^x-z] ... not character from

upper case shortcut means NOT

Simple Text Hightlighter STUB (https://codepen.io/djjrjr/pen/rNYLmMY)

Insider's Tip: An hour or two on a RegEx Tutorial or two will make you a solid beginner.

https://regexone.com/

A second and third session and a little practice can make you an expert.

But you'll always have to look stuff up.

https://regex101.com/

RegEx stepwise: GROUPS

Patterns are defined in terms of groups

([A-Za-z][0-9][A-Za-z])( +)([0-9][A-Za-z][0-9])

This pattern has three groups

letter-digit-letter
one or more spaces
digit-letter-digit

RegEx stepwise: GROUPS

([A-Za-z][0-9][A-Za-z]) ( +) ([0-9][A-Za-z][0-9])

group 1 group 2 group 3

Quantifiers

RegEx
Anatomy

Sidebar on greedy and lazy operators

const modifiedQuote = 'but [he] ha[s] to go read this novel [Alice in Wonderland].';
const regexpModifications = /\[.*\]/g;       //or \[.*?\]

The pattern has three parts:

\[

.*

\]

Parts 1 and 3 can match only a single character. But part two can "consume" any number of characters, as long as it keeps matching. In this case, the star modifies a period which matches any character. How would this proceed?

Read b and try to match part1 of pattern. Fail.

Read u and try to match part1 of pattern. Fail.

Read t and try to match part1 of pattern. Fail.

Read SP and try to match part1 of pattern. Fail.

Read [ and try to match part1 of pattern. Succeed. Part1 is now exhausted.

Read h and try to match part2 of pattern. Succeed. Part2 still in play.

continues until $ end of input which no longer matches part2 so part2 exhausts and we go to part3

$ is no match for part3 so we back track and examine .

. is no match so we back track to ]

] is a match so Part3 is exhausted

No more parts to pattern so pattern is matched

What is the alternative? Instead of continuing to chomp on the input until it no longer could, part2 of the pattern could "look ahead" and ask whether the next input token would match part3 and if so then it could relinquish control and allow the pattern match to continue with the part3.

How does the ? accomplish this? It switches the * to "lazy mode." In other words, *? is, in a sense, a different quantifier than * by itself. Similarly we have +? and ?? which are lazy versions of + and ?.

See: "Greedy and lazy quantifiers"

Sidebar on greedy and lazy operators

See: "Greedy and lazy quantifiers"

A greedy quantifier continues to consume input until it cannot.

A lazy quantifier yields to the next part of the pattern as soon as it can.

Groups & Ranges

RegEx
Anatomy

Character Classes

RegEx
Anatomy

Assertions

RegEx
Anatomy

Using Regular Expressions

RegExp.exec(String)

RegExp.test(String)

"execute" this regular expression on String. Returns array with match as first element [0]

executes a search for a match between a RegExp in String. Returns true or false.

RegEx
Anatomy

Using Regular Expressions

RegExp.methods()

String.methods()

RegExp.test(string)

String.match(RegExp)

Back to Anatomy

String.search(RegExp)

String.replace(RegExp, Str)

String.split(RegExp)

RegExp.exec(string)

RegEx Use Cases

Give STR the "full treatment" of PATTERN? use RegExp.exec(STR)

Does STR match PATTERN? use RegExp.prototype.test(STR)

Find location of PATTERN in STR? use String.prototype.search(PATTERN)

exec

USING
REGEX

Back to Anatomy

test

USING
REGEX

Back to Anatomy

match

USING
REGEX

Back to Anatomy

search

USING
REGEX

Back to Anatomy

replace

USING
REGEX

Back to Anatomy

split

USING
REGEX

Back to Anatomy

EXTRAS

matchAll()

replaceAll()

HTML TEXT BOX Multiple word changes

Outline

Chapter 4: Applications

Chapter 3: Getting Fancy

Chapter 2: Reverse engineering patterns - regex 101

Chapter 1: Fun with Patterns - Bricks, Tiles, and Why

WHY?

Pattern Generation/Recognition

Last Week: iteration can produce complex patterns
This Week: can iteration recognize complex patterns
Why?
- Science: pattern suggests process
- Intervention: pattern suggests condition
- x: pattern suggests identity
- y: pattern suggests category

Pattern Generation/Recognition

Science: pattern → process

We look for patterns in phenomena - this tends to occur near that, after that, before that - on the assumption that things cause things and that things can be detected directly even if causing cannot.

Pattern Generation/Recognition

Intervention: pattern → condition

Decisions to intervene depend on an assessment of conditions but the condition may be an abstraction based on complex arrays of indicators. Is this a heart attack or just indigestion? Do these diplomatic moves suggest truce or a new offensive? Does this internet chatter suggest a terrorist plot? Does this set of covid data suggest a third wave?

Pattern Generation/Recognition

x: pattern → identity

Unique (or effectively unique) patterns can be used to identify unique entities. DNA matching is the obvious example from TV crime shows, facial recognition from contemporary conversations about AI.

Pattern Generation/Recognition

y: pattern → category

Sometimes things that are not identical are never the less considered to be members of the same category. In natural language processing, for example, we might want to stem words in a text. Or we might want to identify all sentences that have a prepositional phrase in them.

Check In...

Reverse engineering patterns - regex 101

Chapter in Eloquent JavaScript

RegEx by Stepwise Refinement

To invert a set of characters, that is, to match any character except, write a caret (^) inside the first bracket.

[^a-z]

[^0-9]

denotes not a lower case letter

denotes not a digit

RegEx by Stepwise Refinement

Common character sets have shorthand

\d   any digit character 
\D   any character that is not a digit 
\w   any alphanumeric character (“word  
     character”) 
\W   any nonalphanumeric character 
\s   any whitespace character (space,
     tab, etc.) 
\S   any non-whitespace character 
.    any character except for newline*

* Note: $ matches the end of the line, not the newline
        character which comes after the end of the line.

RegEx stepwise: REPETITION

Patterns are defined by repetition

[a-z]?

[a-z]*

[a-z]+

[a-z]{3}

[a-z]{3,5}

zero or one lower case letter

zero or more lower case letters

one or more lower case letters

three lower case letters

three to five lower case letters

Examples: REPETITION

[0-9]{3}[0-9]{3}[0-9]{3}

[0-9]{9}

[0-9]{4}[0-9]{2}[0-9]{3}

3 ways to specify a nine digit number

STOP+THINK

A Canadian social insurance number has 9 digits - three groups of three separated by hyphens. A temporary SIN starts with a nine. Create regex that will recognize:

any SIN
permanent SIN
temporary SIN

Examples: REPETITION

[A-Z]?[a-z]*

[0-9]{3}-[0-9]{3}-[0-9]{3}

9[0-9]{2}-[0-9]{3}-[0-9]{3}

[0-8][0-9]{2}-[0-9]{3}-[0-9]{3}

any word, including ''

Canadian S.I.N.

Temporary S.I.N.

Permanent S.I.N.

Wikipedia, "Social Insurance Number"

Canada. "Social Insurance Number - Overview"

STOP+THINK

Create a regular expression that will recognize different kinds of numbers:

integers
decimals
/-

STOP+THINK

Create a regular expression that will recognize prepositional phrases

RegEx Stepwise: BOUNDARIES

Patterns can be attached to boundaries

Text

Examples

Start of line, end of line

start of word, end of word

RegEx Stepwise: CHOICES

Patterns can contain options

\.(com|ca|edu|org|biz|tv)

asdf

Check In...

Chapter 3: Getting Fancy with JS

RegEx in JS: Basics

In JS a regular expression is a special kind of object
Can be created in two ways
- Specify by a string using RegExp constructor
- Specify as literal between forward slashes

RegEx in JS: References

References TAG=javascript AND regex

What can RegEx Objects Do?

test whether a string contains a match
create an array of matches
do things to each match

RegEx in JS

REGEX.test(string)

returns true if the pattern is found, false if not

RegEx in JS

REGEX.test(string)

Experiment with boundaries:
\b word, start of line ^, end of line $

RegEx in JS

REGEX.exec(string)

return the matching substring and position as "index"

RegEx by Stepwise Refinement

JavaScript

REGEX.exec(string)

exec can be used repeatedly if the "g" flag is on in the regular expression

RegEx by Stepwise Refinement

JavaScript

REGEX.exec(string)

if regex has groups, exec returns matches to each one in the array

SO FAR

How to create a regex object (literally or with "new"

regex.test(STRING) returns true or false

regex.exec(STRING) returns the matching substring(s) and location

Extra : FLAGS

/regex/gm

i : search is case-insensitive: no difference between A and a.
g : search looks for all matches, without it – only the first match is returned.
m : Multiline mode.

Also, but outside our current view:

s : Enables “dotall” mode, that allows a dot . to match newline character \n.
u : Enables full unicode support.
y : “Sticky” mode: searching at the exact position in the text

RegEx in JS

string.match(REGEX)

/g returns array of matches

no g returns matches by groups

Mini Data Cleaner

Check In...

Applications

RegEx in MS Word

sort of

MS Word's "RegEx" = Wildcards

? is used to Represent a single character and * represents any number of characters.!
@ is used to find one or more occurrences of the previous character. For example, lo@t will find lot or loot, ful@ will find ful or full etc.
< and > to mark the start and end of a word, respectively.
Square brackets are always used in pairs and are used to identify specific characters or ranges of characters.
\ is used to "escape" characters that have special meaning in wild cards
[!] finds any character not listed in the box
Curly brackets are used for counting occurrences.
- {n} finds exactly “n” occurrences.
- {n,} finds at least the number “n” occurrences.
- {n,m} finds between “n” and “m” occurrences.

Word Wildcards

RegEx in MS Excel

sort of

RegEx Golf

https://alf.nu/RegexGolf

Example: find prepositional phrases

Check In...

OutTakes

From Scraped Text to Clean Data

records and fields
csv & tsv & fixed column formats

Patterns

Patterns

Playlist for Patterns

Notes (Google drive)

Image Gallery

Resources

What to Learn

Read

Define "pattern"

Discuss & Combine

A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. Any of the senses may directly observe patterns.

and Text

Bricks,

Tiles,

Fun with Patterns

are

SEE ALSO

Tiles

Fun

See Also

Truchet Tiles

Escher Tiles

Tiles

From Creating to Recognizing PATTERN

imagine a function definition

abstract to a black box

Patterns are about sequence and repetition.

STOP+THINK: What's the pattern?

Sometimes we want to use patterns to find things and sometimes we want to use patterns to constrain things.

WANTED: An easy to use notation to describe any pattern in text

1 Specify SETS of characters

[0123456789] means "a digit"

[0-9] means "a digit"

2. Specify REPETITION

[0-9]{3} means "3 digits"

[0-9]{2,5} means at least 2 and at most 5 digits

[2-9]

[2-9]

[2-9]

[2-9]

[2-9]

[2-9]

[2-9]

[2-9]

[2-9]

[2-9]

[0-9]{3,} means at least 3 digits

[0-9]{,5} means up to 5 digits

[2-9]

[0-9]{2}

[0-9]{3}

[0-9]{4}

STOP+THINK

STOP+THINK

([A-Z][a-z]{,6} ){2,3}

Shorthand for [0-9] is \d

that is \d means "any digit"

Shorthand

"backslash d"

\t matches a tab

\n matches a new line

\r matches carriage return

\f matches a page break

\s matches all of these and SPACE

And...

\s = [ \t \n \r \f]

Specify OPTIONS and ALTERNATIVES

STOP+THINK

STOP+THINK

REVIEW CHARACTER SETS

Examples

Summary So Far

3. Specify Boundaries

STOP+THINK

Stop + Think

Stop + Think

Stop + Think

Stop + Think

Regular Expressions

REGEX

From Creating to Recognizing
PATTERN

STOP+THINK:
What's the pattern?