Loading deck

Sets, Bags, and Relational Theory

Real-World Interactions

with Data

Ryan B. Harvey

∏ME • January 26, 2017

Ryan B. Harvey

Coder, Datahead, PM, Wonk, Dad.

@CodeAndData • rharvey[at]loyno.edu

http://datascientist.guru

 

Data Engineer

TED Conferences

 

Data Scientist & Software Architect

Kitchology, Inc.

 

Research Associate Pro Bono

Loyola University New Orleans

 

Co-Organizer & Data Wrangler

Code for New Orleans

A quick review of set theory...

Important: Set members are unique.

"The universe"

X=\lbrace0,1,2,3,4,5...\rbrace=\lbrace x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrace=\lbrace x|x\in\mathbb{N}\vee x\in\lbrace0\rbrace\rbrace
X={0,1,2,3,4,5...}={xxN{0}}={xxNx{0}}X=\lbrace0,1,2,3,4,5...\rbrace=\lbrace x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrace=\lbrace x|x\in\mathbb{N}\vee x\in\lbrace0\rbrace\rbrace
Y=\lbrace...-5,-4,-3,-2,-1\rbrace=\lbrace -x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrace=\lbrace -x|x\in\mathbb{N}\vee x\in\lbrace0\rbrace\rbrace
Y={...5,4,3,2,1}={xxN{0}}={xxNx{0}}Y=\lbrace...-5,-4,-3,-2,-1\rbrace=\lbrace -x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrace=\lbrace -x|x\in\mathbb{N}\vee x\in\lbrace0\rbrace\rbrace
X
XX
Y
YY
X-Y
XYX-Y
Y-X
YXY-X
\lbrace 0\rbrace
{0}\lbrace 0\rbrace
\mathbb{N}
N\mathbb{N}
X\cap Y
XYX\cap Y

Set operations

Special sets

X\cup Y=\mathbb{Z}=\lbrace...-3,-2,-1,0,1,2,3...\rbrace
XY=Z={...3,2,1,0,1,2,3...}X\cup Y=\mathbb{Z}=\lbrace...-3,-2,-1,0,1,2,3...\rbrace
X\cap Y=\lbrace0\rbrace
XY={0}X\cap Y=\lbrace0\rbrace
X\setminus Y=\mathbb{N}
XY=NX\setminus Y=\mathbb{N}
Y\setminus X=\lbrace-y|y\in\mathbb{N}\rbrace
YX={yyN}Y\setminus X=\lbrace-y|y\in\mathbb{N}\rbrace
\text{The null set: }\emptyset=\lbrace\rbrace,\text{ }\#\emptyset=0
The null set: ={}, #=0\text{The null set: }\emptyset=\lbrace\rbrace,\text{ }\#\emptyset=0
\text{Complement of }X\text{: }\overline X=\lbrace x|x\in\mathcal{U}\wedge x\notin X\rbrace=\lbrace x|x\in(\mathcal{U}\setminus X)\rbrace
Complement of X: X={xxUxX}={xx(UX)}\text{Complement of }X\text{: }\overline X=\lbrace x|x\in\mathcal{U}\wedge x\notin X\rbrace=\lbrace x|x\in(\mathcal{U}\setminus X)\rbrace

Elements of a set come from a (possibly infinite) universe, denoted     .

\lbrace1,2,3,4...\rbrace\subset X
{1,2,3,4...}X\lbrace1,2,3,4...\rbrace\subset X
\lbrace0,1,2,3,4\rbrace\subseteq X
{0,1,2,3,4}X\lbrace0,1,2,3,4\rbrace\subseteq X
\mathcal{U}
U\mathcal{U}
\mathcal{U}
U\mathcal{U}

And a quick review of Booleans...

Boolean logic operations

\neg x\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``\text{not }x\text{''}
¬xdef=not x''\neg x\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``\text{not }x\text{''}

Boolean variables and expressions can take two values: True, False

We've been using them implicitly in our definitions of sets, but let's look a bit closer at operations involving Booleans.

x\wedge y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``x\text{ and }y\text{''}
xydef=x and y''x\wedge y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``x\text{ and }y\text{''}
x\vee y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``x\text{ or }y\text{''}
xydef=x or y''x\vee y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``x\text{ or }y\text{''}
x\oplus y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``x\text{ exclusive or }y\text{''}
xydef=x exclusive or y''x\oplus y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;``x\text{ exclusive or }y\text{''}
Result
x\oplus y
xyx\oplus y
x\vee y
xyx\vee y
x\wedge y
xyx\wedge y
\neg x
¬x\neg x

True

False

True

False

True

False

True

False

True

False

False

False

True

False

True

False

True

False

True

False

True

False

False

True

True

True

True

False

From sets to bags

Key idea: What happens if we remove the uniqueness constraint on set members?

From sets to bags (multisets)...

Important: Bag members are not necessarily unique.

X=\lbrack0,1,2,3,4,5...\rbrack=\lbrack x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrack
X=[0,1,2,3,4,5...]=[xxN{0}]X=\lbrack0,1,2,3,4,5...\rbrack=\lbrack x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrack
Y=\lbrack...-5,-4,-3,-2,-1\rbrack=\lbrack -x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrack
Y=[...5,4,3,2,1]=[xxN{0}]Y=\lbrack...-5,-4,-3,-2,-1\rbrack=\lbrack -x |x\in \mathbb{N}\cup\lbrace0\rbrace\rbrack

Bag operations

There is no complement of a bag. (What multiplicity do you assign?)

\text{Membership: }x\text{ is in }X\text{ }\Leftrightarrow x\in X
Membership: x is in X xX\text{Membership: }x\text{ is in }X\text{ }\Leftrightarrow x\in X

No longer "membership" -- now we need "multiplicity" to describe.

\text{Multiplicity: }X\text{ has }5\text{ copies of }x\Leftrightarrow\#_{\in X}(x)=5
Multiplicity: X has 5 copies of x#X(x)=5\text{Multiplicity: }X\text{ has }5\text{ copies of }x\Leftrightarrow\#_{\in X}(x)=5

Bag containment is defined in terms of multiplicity.

X\subseteq W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;\forall x\in X, \#_{\in X}(x)\leq\#_{\in W}(x)
XWdefxX,#X(x)#W(x)X\subseteq W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;\forall x\in X, \#_{\in X}(x)\leq\#_{\in W}(x)
X\subset W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;X\subseteq W\wedge X\neq W
XWdefXWXWX\subset W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;X\subseteq W\wedge X\neq W

Bag operations (continued)...

X\cup Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;\text{ the smallest bag }W\text{ such that }X\subseteq W\wedge Y\subseteq W
XYdef= the smallest bag W such that XWYWX\cup Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;\text{ the smallest bag }W\text{ such that }X\subseteq W\wedge Y\subseteq W
\text{Multiplicity: }X\text{ has }5\text{ copies of }x\Leftrightarrow\#_{\in X}(x)=5
Multiplicity: X has 5 copies of x#X(x)=5\text{Multiplicity: }X\text{ has }5\text{ copies of }x\Leftrightarrow\#_{\in X}(x)=5

Bag containment is defined in terms of multiplicity.

X\subseteq W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;\forall x\in X, \#_{\in X}(x)\leq\#_{\in W}(x)
XWdefxX,#X(x)#W(x)X\subseteq W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;\forall x\in X, \#_{\in X}(x)\leq\#_{\in W}(x)
X\subset W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;X\subseteq W\wedge X\neq W
XWdefXWXWX\subset W\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize \Longleftrightarrow \\ \end{matrix}}\;X\subseteq W\wedge X\neq W
X\cap Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;\text{ the largest bag }W\text{ such that }W\subseteq X\wedge W\subseteq Y
XYdef= the largest bag W such that WXWYX\cap Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\;\text{ the largest bag }W\text{ such that }W\subseteq X\wedge W\subseteq Y

"smallest" and "largest" are measured relative to containment ordering

X\setminus Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\; \text{ the bag }W\subseteq X\text{ such that } \forall x\in\mathcal{U},
XYdef= the bag WX such that xU,X\setminus Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\; \text{ the bag }W\subseteq X\text{ such that } \forall x\in\mathcal{U},
\#_{\in W}(x) = \max((\#_{\in X}(x)-\#_{\in Y}(x)), 0)
#W(x)=max((#X(x)#Y(x)),0)\#_{\in W}(x) = \max((\#_{\in X}(x)-\#_{\in Y}(x)), 0)

Bag operations (new ones)...

\text{Multiplicity: }X\text{ has }5\text{ copies of }x\Leftrightarrow\#_{\in X}(x)=5
Multiplicity: X has 5 copies of x#X(x)=5\text{Multiplicity: }X\text{ has }5\text{ copies of }x\Leftrightarrow\#_{\in X}(x)=5
\delta(X)\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\; \text{ the set }\lbrace x|x\in X\rbrace
δ(X)def= the set {xxX}\delta(X)\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\; \text{ the set }\lbrace x|x\in X\rbrace

Define the "duplicate elimination function"

X\sqcup Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\; \text{ the bag }W\text{ such that }\forall x\in\mathcal{U},
XYdef= the bag W such that xU,X\sqcup Y\;{\tiny\begin{matrix} _{\scriptsize \text{def}} \\ \normalsize = \\ \end{matrix}}\; \text{ the bag }W\text{ such that }\forall x\in\mathcal{U},

Define the "bag concatenation function"

Note that this function takes a bag and returns a set of the same elements, with each element having multiplicity one.

\#_{\in W}(x)=\#_{\in X}(x)+\#_{\in Y}(x)
#W(x)=#X(x)+#Y(x)\#_{\in W}(x)=\#_{\in X}(x)+\#_{\in Y}(x)

Whereas bag union results in a bag with max multiplicity for each element among the two operand bags, bag concatenation results in each element having sum multiplicity

From bags to relations

Key idea: What happens if we have more complex bag members than just numbers?

If that wasn't enough... relations!

\lbrace A_1,A_2,...,A_n\rbrace
{A1,A2,...,An}\lbrace A_1,A_2,...,A_n\rbrace

A relation scheme       is a finite set of attribute names

                                .

Corresponding to each attribute name        is a set        called the

domain of       , which we can denote                               .

Domains are arbitrary, non-empty sets, which may be finite or

countably infinite.

Define                                                    , the superset of all domains of attribute names in     .

 

A relation      on relation scheme      is a finite set of mappings

                                                      such that for each

                    .  The mappings       are called tuples.

Definitions

R
RR
A_i
AiA_i
D_i
DiD_i
A_i
AiA_i
D_i=\text{dom}(A_i)
Di=dom(Ai)D_i=\text{dom}(A_i)
\mathbb{D}=D_1\cup D_2\cup\cdots\cup D_n
D=D1D2Dn\mathbb{D}=D_1\cup D_2\cup\cdots\cup D_n
R
RR
R
RR
r
rr
\lbrace t_1, t_2, \cdots, t_p\rbrace, t_k:R\rightarrow \mathbb{D}
{t1,t2,,tp},tk:RD\lbrace t_1, t_2, \cdots, t_p\rbrace, t_k:R\rightarrow \mathbb{D}
t_k\in r, t_k(A_i)\in D_i,
tkr,tk(Ai)Di,t_k\in r, t_k(A_i)\in D_i,
1\leq i\leq n
1in1\leq i\leq n
t_k
tkt_k

Um, whut?

Whut?

ADDRESS STATUS DISTRICT CATEGORY DESCRIPTION
13634 Dwyer Closed E Street Light Pole #IBI35
Glouster & Norgate Open E Sidewalk Repair Accessibility ramp cracked
7203 Benjamin Closed A Street Light Pole #BP45
2900 Dublin Closed A Pothole Travel lane
2521 Marais Closed C Sidewalk Repair Cracked
1113 Mandeville Closed C Abandoned Vehicle White Cadillac
1914 Foucher Open B Garbage Pickup Start trash service

Excerpts from the City of New Orleans' 311 service calls data set, available at data.nola.gov

Whut?

ADDRESS STATUS DISTRICT CATEGORY DESCRIPTION
13634 Dwyer Closed E Street Light Pole #IBI35
Glouster & Norgate Open E Sidewalk Repair Accessibility ramp cracked
7203 Benjamin Closed A Street Light Pole #BP45
2900 Dublin Closed A Pothole Travel lane
2521 Marais Closed C Sidewalk Repair Cracked
1113 Mandeville Closed C Abandoned Vehicle White Cadillac
1914 Foucher Open B Garbage Pickup Start trash service
\text{relation scheme }R=\text{CALLS}=\{A_i\}_{i=1}^5
relation scheme R=CALLS={Ai}i=15\text{relation scheme }R=\text{CALLS}=\{A_i\}_{i=1}^5
(n=5)
(n=5)(n=5)

Whut?

ADDRESS STATUS DISTRICT CATEGORY DESCRIPTION
13634 Dwyer Closed E Street Light Pole #IBI35
Glouster & Norgate Open E Sidewalk Repair Accessibility ramp cracked
7203 Benjamin Closed A Street Light Pole #BP45
2900 Dublin Closed A Pothole Travel lane
2521 Marais Closed C Sidewalk Repair Cracked
1113 Mandeville Closed C Abandoned Vehicle White Cadillac
1914 Foucher Open B Garbage Pickup Start trash service
\text{relation }r=\text{status(CALLS)}=\{t_k\}_{k=1}^7
relation r=status(CALLS)={tk}k=17\text{relation }r=\text{status(CALLS)}=\{t_k\}_{k=1}^7
(p=7)
(p=7)(p=7)

Whut?

ADDRESS STATUS DISTRICT CATEGORY DESCRIPTION
13634 Dwyer Closed E Street Light Pole #IBI35
Glouster & Norgate Open E Sidewalk Repair Accessibility ramp cracked
7203 Benjamin Closed A Street Light Pole #BP45
2900 Dublin Closed A Pothole Travel lane
2521 Marais Closed C Sidewalk Repair Cracked
1113 Mandeville Closed C Abandoned Vehicle White Cadillac
1914 Foucher Open B Garbage Pickup Start trash service
\text{tuple }t_5=(``\text{2521 Marais"}, ``\text{Closed"}, ``\text{C"}, ``\text{Sidewalk Repair"}, ``\text{Cracked"})
tuple t5=(2521 Marais",Closed",C",Sidewalk Repair",Cracked")\text{tuple }t_5=(``\text{2521 Marais"}, ``\text{Closed"}, ``\text{C"}, ``\text{Sidewalk Repair"}, ``\text{Cracked"})

Whut?

ADDRESS STATUS DISTRICT CATEGORY DESCRIPTION
13634 Dwyer Closed E Street Light Pole #IBI35
Glouster & Norgate Open E Sidewalk Repair Accessibility ramp cracked
7203 Benjamin Closed A Street Light Pole #BP45
2900 Dublin Closed A Pothole Travel lane
2521 Marais Closed C Sidewalk Repair Cracked
1113 Mandeville Closed C Abandoned Vehicle White Cadillac
1914 Foucher Open B Garbage Pickup Start trash service
A_3=``\text{DISTRICT"}, \text{dom}(``\text{DISTRICT"})=\{x|x\text{ is a one-letter city council district code}\}
A3=DISTRICT",dom(DISTRICT")={xx is a one-letter city council district code}A_3=``\text{DISTRICT"}, \text{dom}(``\text{DISTRICT"})=\{x|x\text{ is a one-letter city council district code}\}

Whut?

ADDRESS STATUS DISTRICT CATEGORY DESCRIPTION
13634 Dwyer Closed E Street Light Pole #IBI35
Glouster & Norgate Open E Sidewalk Repair Accessibility ramp cracked
7203 Benjamin Closed A Street Light Pole #BP45
2900 Dublin Closed A Pothole Travel lane
2521 Marais Closed C Sidewalk Repair Cracked
1113 Mandeville Closed C Abandoned Vehicle White Cadillac
1914 Foucher Open B Garbage Pickup Start trash service
A_2=``\text{STATUS"}, \text{dom}(``\text{STATUS"})=\{``\text{Open"},``\text{Closed"}\}
A2=STATUS",dom(STATUS")={Open",Closed"}A_2=``\text{STATUS"}, \text{dom}(``\text{STATUS"})=\{``\text{Open"},``\text{Closed"}\}

Relational algebra

Key idea: Can define an algebra similar to set and bag operations on relations?

An algebra on relations

For a relation scheme    , if            and           are both relations on    ,

consider    and    as sets over the same universe    .  Then we can

apply the boolean operations as follows.

Boolean Operators

A_i
AiA_i
R
RR
R
RR
\text{Let dom}(R)=\{\text{all tuples }t\text{ over all }A_i\in R\text{ and their domains}\}
Let dom(R)={all tuples t over all AiR and their domains}\text{Let dom}(R)=\{\text{all tuples }t\text{ over all }A_i\in R\text{ and their domains}\}
r\cap s = \text{relation on }R\text{ containing tuples }t\text{ such that }t\in r\text{ and }t\in s
rs=relation on R containing tuples t such that tr and tsr\cap s = \text{relation on }R\text{ containing tuples }t\text{ such that }t\in r\text{ and }t\in s
r(R)
r(R)r(R)
s(R)
s(R)s(R)
s
ss
r
rr
\mathcal{U}
U\mathcal{U}
r\cup s = \text{relation on }R\text{ containing tuples }t\text{ such that }t\in r\text{ or }t\in s
rs=relation on R containing tuples t such that tr or tsr\cup s = \text{relation on }R\text{ containing tuples }t\text{ such that }t\in r\text{ or }t\in s
r\setminus s = \text{relation on }R\text{ containing tuples }t\text{ such that }t\in r\text{ and }t\notin s
rs=relation on R containing tuples t such that tr and tsr\setminus s = \text{relation on }R\text{ containing tuples }t\text{ such that }t\in r\text{ and }t\notin s
r\cap s=r\setminus(r\setminus s)
rs=r(rs)r\cap s=r\setminus(r\setminus s)

Active Domain

Note that: 

Define active domain of       relative to relation    as

r
rr
\text{adom}(A_i,r)=\{d\in D_i|\exists t\in r\text{ with }t(A_i)=d\}
adom(Ai,r)={dDitr with t(Ai)=d}\text{adom}(A_i,r)=\{d\in D_i|\exists t\in r\text{ with }t(A_i)=d\}

An algebra on relations

Using the domain of scheme      to compute the complement    

 

of            is not closed on relations.  If any attribute 

has       such that                       then                   , and    is not a relation.

Complement of a Relation

R
RR
\#\bar r=\infty
#r¯=\#\bar r=\infty
r(R)
r(R)r(R)
A_i\in R
AiRA_i\in R
\bar r=\text{dom}(R)\setminus r
r¯=dom(R)r\bar r=\text{dom}(R)\setminus r
\bar r
r¯\bar r
\#D_i=\infty
#Di=\#D_i=\infty
D_i
DiD_i

Thus, using the active domain, we define the active complement

 

which is finite, and always results in a relation on    .

R
RR
\tilde r=\text{adom}(R,r)\setminus r
r~=adom(R,r)r\tilde r=\text{adom}(R,r)\setminus r

The set of all relations on a given scheme     is closed under

union, intersection, set difference, and active complement.

R
RR

An algebra on relations

Select is a unary operator on relations yeilding another relation that

is the subset of tuples with a given value on a specified attribute.

The "select" operator

\sigma_{A=a}(r)=r'(R)=\{t\in r|t(A)=a\}
σA=a(r)=r(R)={trt(A)=a}\sigma_{A=a}(r)=r'(R)=\{t\in r|t(A)=a\}
\sigma_{A=a}(r)
σA=a(r)\sigma_{A=a}(r)
a
aa

With           a relation on scheme    ,     an attribute name in    , and

                         ,

 

                 is read "select     equal to     on   ."

a\in\text{dom}(A)
adom(A)a\in\text{dom}(A)
\sigma
σ\sigma
A
AA
r
rr

Select filters a relation's tuples leaving only those with the given

attribute value.  Recalling the data table example, this is filtering

rows of the table by the value in a particular column.

A
AA
r(R)
r(R)r(R)
R
RR
R
RR

An algebra on relations

The "select" operator       (properties)

    is also distributive over Boolean operators              .  For example,

for relations            and          , and assuming uniqueness of tuples,

\sigma
σ\sigma
\sigma
σ\sigma
\sigma_{A=a}(r\cap s)=\sigma_{A=a}(\{t|t\in r\wedge t\in s\})
σA=a(rs)=σA=a({ttrts})\sigma_{A=a}(r\cap s)=\sigma_{A=a}(\{t|t\in r\wedge t\in s\})
\cup,\cap,\setminus
,,\cup,\cap,\setminus

    is commutative under composition, so that

 

and we can instead write                         , with order of the attribute

equivalences unimportant.

\sigma
σ\sigma
\sigma_{A=a}(\sigma_{B=b}(r))=\sigma_{A=a}\circ\sigma_{B=b}=\sigma_{B=b}\circ\sigma_{A=a}=\sigma_{B=b}(\sigma_{A=a}(r))
σA=a(σB=b(r))=σA=aσB=b=σB=bσA=a=σB=b(σA=a(r))\sigma_{A=a}(\sigma_{B=b}(r))=\sigma_{A=a}\circ\sigma_{B=b}=\sigma_{B=b}\circ\sigma_{A=a}=\sigma_{B=b}(\sigma_{A=a}(r))
\sigma_{A=a,B=b}(r)
σA=a,B=b(r)\sigma_{A=a,B=b}(r)
r(R)
r(R)r(R)
s(R)
s(R)s(R)
=\{t'\in\{t|t\in r\wedge t\in s\}|t'(A)=a\}
={t{ttrts}t(A)=a}=\{t'\in\{t|t\in r\wedge t\in s\}|t'(A)=a\}
=\{t|t\in r\wedge t(A)=a\}\cap\{t|t\in s\wedge t(A)=a\}
={ttrt(A)=a}{ttst(A)=a}=\{t|t\in r\wedge t(A)=a\}\cap\{t|t\in s\wedge t(A)=a\}
=\sigma_{A=a}(\{t|t\in r\})\cap\sigma_{A=a}(\{t|t\in s\})
=σA=a({ttr})σA=a({tts})=\sigma_{A=a}(\{t|t\in r\})\cap\sigma_{A=a}(\{t|t\in s\})
=\sigma_{A=a}(r)\cap\sigma_{A=a}(s)
=σA=a(r)σA=a(s)=\sigma_{A=a}(r)\cap\sigma_{A=a}(s)

An algebra on relations

Project is a unary operator on relations yeilding another relation

having a relational scheme with the specified attributes, and

containing tuples corresponding to each tuple in the operand.

The "project" operator

\pi_X(r)=r'(X)=\{t(X)|t\in r(R)\}
πX(r)=r(X)={t(X)tr(R)}\pi_X(r)=r'(X)=\{t(X)|t\in r(R)\}
\pi_X(r)
πX(r)\pi_X(r)

With           a relation on scheme    , and               ,

 

             is read "the projection of     onto     ."

X\in R
XRX\in R
\pi
π\pi
X
XX
r
rr

Project filter's a relational scheme's attribute list.  Recalling the data table example, this is taking specified columns of the table only.  If there is an assumption of uniqueness of tuples, we would then apply the duplicate removal function.

r(R)
r(R)r(R)
R
RR

An algebra on relations

The "project" operator       (properties)

\pi
π\pi

Under composition, order does matter -- latter applications of  

subsume former, if the latter application is proper.

If                          , then                                                                     .

Similarly, if                                                                   , then

                                                                                                          ,

so that only outermost applications of the project operator need

be considered for evaluation.

 

Project commutes with select when the attribute(s) for selection

are among those in the set onto which the projection is taking place.

\pi
π\pi
X\subseteq Y\subseteq R
XYRX\subseteq Y\subseteq R
\pi_X\circ\pi_Y(r)=\pi_X(\pi_Y(r))=\pi_X(r)
πXπY(r)=πX(πY(r))=πX(r)\pi_X\circ\pi_Y(r)=\pi_X(\pi_Y(r))=\pi_X(r)
X_1\subseteq X_2\subseteq X_3\subseteq\cdots\subseteq X_n\subseteq R
X1X2X3XnRX_1\subseteq X_2\subseteq X_3\subseteq\cdots\subseteq X_n\subseteq R
\pi_{X_1}\circ\pi_{X_2}(r)\circ\pi_{X_2}(r)\cdots\pi_{X_n}(r)=\pi_{X_1}(r)
πX1πX2(r)πX2(r)πXn(r)=πX1(r)\pi_{X_1}\circ\pi_{X_2}(r)\circ\pi_{X_2}(r)\cdots\pi_{X_n}(r)=\pi_{X_1}(r)
\text{For }A\in X\subseteq R\text{ and }r(R)\text{ a relation, }\pi_X(\sigma_{A=a}(r))=\sigma_{A=a}(\pi_X(r))
For AXR and r(R) a relation, πX(σA=a(r))=σA=a(πX(r))\text{For }A\in X\subseteq R\text{ and }r(R)\text{ a relation, }\pi_X(\sigma_{A=a}(r))=\sigma_{A=a}(\pi_X(r))

An algebra on relations

Join is a binary operator on relations yielding another relation

with a schema having a combination of attribute names, and 

containing tuples corresponding to the tuples in each operand.

The "join" operator

r\bowtie s
rsr\bowtie s
q(RS)
q(RS)q(RS)

Given relations            and          ,              ("the join of    and    ") is the

relation               of all tuples    over        such that there are tuples 

                            with                                         .

RS
RSRS
\bowtie
\bowtie
t
tt
r
rr

If                      , then              is the Cartesian product of    and   .

r(R)
r(R)r(R)
t_r=t(R), t_s=t(S)
tr=t(R),ts=t(S)t_r=t(R), t_s=t(S)
s(S)
s(S)s(S)
s
ss
t_r\in r, t_s\in s
trr,tsst_r\in r, t_s\in s
t_r(R\cap S)=t_s(R\cap S)
tr(RS)=ts(RS)t_r(R\cap S)=t_s(R\cap S)

Since               is a  subset of both      and     , then 

R
RR
r
rr
R\cap S
RSR\cap S
s
ss
S
SS
r\bowtie s
rsr\bowtie s
R\cap S=\emptyset
RS=R\cap S=\emptyset

An algebra on relations

Let's review...

Operation Operator Function
Select filter rows/tuples
Project filter columns/attributes
Join combine relations
Booleans create filter conditions
Rename (new!) re-label an attribute
\sigma
σ\sigma
\pi
π\pi
\bowtie
\bowtie
\cap,\cup,\setminus
,,\cap,\cup,\setminus
\rho
ρ\rho

Further reading...

There's much more, and there's relational calculus too.

Maier, David.  The Theory of Relational Databases.  Computer Science Press, 1983.

 

Elmari, Ramez and Shamkant B. Navathe. Fundamentals of Database Systems. The Benjamin/Cummings Publishing Company, 1989.

 

Albert, Joseph. Algebraic Properties of Bag Data Types. Proceedings of the 17th International Conference on Very Large Data Bases. Barcelona, 1991.

 

Pagh, Ramus.  Lecture notes on Relational algebra and SQL. Introduction to Databases, IT University of Copenhagen, 2005. http://www.itu.dk/people/pagh/IDB05/Algebra.pdf

To the database!

Key idea: How is this algebra useful in real-world interactions with data?

Now for some SQL

Let's augment our relational operations table...

Operation Operator Function SQL clause
Select filter rows WHERE, HAVING
Project filter columns SELECT col list
Join combine relations JOIN
Booleans filter conditions AND, OR, NOT IN
Rename (new!) re-label attributes as
Dup removal remove duplicates DISTINCT
\sigma
σ\sigma
\pi
π\pi
\bowtie
\bowtie
\cap,\cup,\setminus
,,\cap,\cup,\setminus
\rho
ρ\rho
\delta
δ\delta

Now for some SQL

Some practical things to help us out...

We can add constraints to columns in tables to specify the domain of those attributes.

 

If we don't want to deal with the troubles duplicates cause, we can ensure uniqueness by adding a "key" constraint to one or more attributes/columns.

 

We can also add constraints to specify that each tuple in the relation must have a non-empty value for a particular attribute or set of attributes.

Now for some SQL

Let's try writing an expression

Say we want to find all the tuples in the status(CALLS) relation with STATUS "Open" and DISTRICT "A", and find only the CATEGORY.

 

The relational algebra expression for this looks like

 

 

If we were to use SQL to rewrite this operation on the database table, it looks like

          SELECT CATEGORY

          FROM CALLS

          WHERE STATUS='Open' AND DISTRICT='A'

\pi_{\text{CATEGORY}}\circ\sigma_{\text{STATUS}=``\text{Open"},\text{DISTRICT}=``\text{A"}}(\text{status}(\text{CALLS}))
πCATEGORYσSTATUS=Open",DISTRICT=A"(status(CALLS))\pi_{\text{CATEGORY}}\circ\sigma_{\text{STATUS}=``\text{Open"},\text{DISTRICT}=``\text{A"}}(\text{status}(\text{CALLS}))

Demo