Postdoc Event Talk

1o Meeting of Postdocs at IME-USP

Victor Sanches Portella

April, 2025

ime.usp.br/~victorsp

The Mathematics of

Learning

Online

Private

and

Who am I?

Postdoc supervised by prof. Yoshiharu Kohayakawa

ML Theory

Optimization

Randomized Algs

Optimization

p_t

1 - p_t

My interests according to a student:

"Crazy Algorithms"

Online Learning

Prediction with Expert's Advice

Player

Adversary

\(n\) Experts

0.5

0.1

0.3

0.1

Probabilities

p_t

-1

0.5

-0.3

Costs

\ell_t \in [-1,1]^n

Player's loss:

\langle \ell_t, p_t \rangle

Adversary knows the strategy of the player

\mathbb{E}[\ell_t(i)]

Measuring Player's Perfomance

\displaystyle \mathrm{Regret}(T) = \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \min_{i = 1, \dotsc, n} \sum_{t = 1}^T \ell_t(i)

\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle

Total player's loss

Can be = \(T\) always

\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \sum_{t = 1}^T \min_{i = 1, \dotsc, n} \ell_t(i)

Compare with offline optimum

Almost the same as Attempt #1

Restrict the offline optimum

\displaystyle \frac{\mathrm{Regret}(T) }{T} \to 0

Attempt #1

Attempt #2

Attempt #3

Loss of Best Expert

Player's Loss

Goal:

Example

Cummulative Loss

Experts

0.5

t = 1

1.5

0.5

t = 2

1.5

t = 3

2.5

1.5

t = 4

Why Learning with Experts?

Boosting in ML

Understanding sequential prediction & online learning

Universal Optimization

TCS, Learning theory, SDPs...

Some Work of Mine

Algorithms design guided by

PDEs and (Stochastic) Calculus tools

Modeling Online Learning in Continuous Time

Analysis often becomes clean

Sandbox for design of optimization algorithms

Gradient flow is useful for smooth optimization

\displaystyle \partial_t x_t = - \nabla f(x_t)

Key Question: How to model non-smooth (online) optimization in continuous time?

Why go to ?

continuous time

Modeling Adversarial Costs in Continuous Time

Total loss of expert \(i\):

\displaystyle L_t(i) = \sum_{s = 1}^t \ell_s(i)

Useful perspective: \(L(i)\) is a realization of a random walk

realization of a Brownian Motion

Probability 1 = Worst-case

Discrete Time

Continuous Time

\displaystyle L_t(i) = B_t(i)

Differential Privacy

What do we mean by "privacy" in this case?

Informal Goal: Output should not reveal (too much) about any single individual

Not considering protections against security breaches

Output

Data Analysis

Output should have information about the population

This has more to do with "confidentiality" than "privacy"

Real-life example - Netflix Dataset

Differential Privacy

\displaystyle \mathcal{M}

Output 1

Output 2

Indistinguishible

Differential Privacy

Anything learned with an individual in the dataset

can (likely) be learned without

\(\mathcal{M}\) needs to be randomized to satisfy DP

Adversary with full information of all but one individual can infer membership

Differential Privacy (Formally)

Any pair of neighboring datasets: they differ in one entry

\(\mathcal{M}\) is \((\varepsilon, \delta)\)-Differentially Private if

\mathbb{P}(\mathcal{M}(X) \in S) \leq e^{\varepsilon} \cdot \mathbb{P}(\mathcal{M}(X') \in S) + \delta

Definition:

\forall S

\((\varepsilon, \delta)\)-DP

\(\varepsilon \equiv \) "Privacy leakage", in theory constant \(\leq 1\)

\(\delta \equiv \) "Chance of failure", usually VERY small

\displaystyle \Bigg \{

An Example: Computing the Mean

Goal:

\mathcal{M}

\((\varepsilon, \delta)\)-DP such that approximates the mean of \(x_i\)'s

Algorithm:

\displaystyle \mathcal{M}(X) = \mathrm{Mean}(X) + Z

Gaussian or Laplace noise

X = (x_1, \dotsc, x_n)

x_i \in [-1,1]

with

Some of my Work - Covariance Estimation

\displaystyle x_1, x_2, \dotsc, x_n \sim \mathcal{N}(0, \Sigma)

\displaystyle \Sigma \succ 0

Unknown Covariance Matrix

\displaystyle X \in \mathbb{R}^{d \times n}

\((\varepsilon, \delta)\)-differentially private \(\mathcal{M}\) to estimate \(\Sigma\)

on \(\mathbb{R}^d\)

There are

ARE THEY OPTIMAL?

Previous results:

Our results:

YES... under some artificial restrictions

YES!

A Lower Bound Strategy

Assume \(\mathcal{M}\) is

accurate

There is high correlaton between output and input

Feed to \(\mathcal{M}\) a marked input \(X\)

\((\varepsilon,\delta)\)-DP implies

Opposing Conditions

Correlation is bounded

Stokes' Theorem (!?)

Stein's Lemma (!)

Online meets Private

Online and Private Learnability are Equivalent

Bounded in terms of VC Dimension of

\mathcal{H}

PAC Learning

How many examples

h^* \colon X \to \{0,1\}

from set of hyptheses

\mathcal{H}

(x_1, h^*(x_1)), (x_2, h^*(x_2)), \dotsc

to "learn"

Online and Private Learnability are Equivalent

PAC Learning

How many examples

h^* \colon X \to \{0,1\}

from set of hyptheses

\mathcal{H}

(x_1, h^*(x_1)), (x_2, h^*(x_2)), \dotsc

to "learn"

Private

with differential privacy

Bounded in terms of Littlestone Dimension of

\mathcal{H}

Characterizes

ONLINE LEARNABILITY

Tighter bounds?

Algorithmic implications?

Circumvent OL?

Online Learning, Privately

We need to release information at every round

But changes are usually incremental

Can we do better than naive algorithms?

Some Applications

ML Training

Synthetic Data Generation

Boosting

How can we make

Online Learning

private?

Online Algorithms

Final Remarks

Differential Privacy is a formal definition of private computation

Online Learning is a powerful learning theory framework

Both are connected?!

Thanks!

Many interesting questions

Better investigate the connections of OL and DP

Find new algorithms and limits for Online DP

DP and Other Areas of ML and TCS

Online Learning

Adaptive Data Analysis and Generalization in ML

Robust statistics

Proof uses Ramsey's Theory :)

Backup Slides

Take Away from Examples

Privacy is quite delicate to get right

Hard to take into account side information

"Anonymization" is hard to define and implement properly

Different use cases require different levels of protection

Real-life example - NY Taxi Dataset

Summary: License plates were anonymized using MD5

Easy to de-anonymize due to lincense plate structure

By Vijay Pandurangan
https://www.vijayp.ca/articles/blog/2014-06-21_on-taxis-and-rainbows--f6bc289679a1.html

An Example: Computing the Mean

\mathbb{E}\Big[\lVert \mathcal{M}(X) - \mathrm{Mean}(x)\rVert \Big]

Goal:

is small

\mathcal{M}

\((\varepsilon, \delta)\)-DP such that approximates the mean:

Algorithm:

\displaystyle \mathcal{M}(x) = \mathrm{Mean}(x) + Z

Gaussian or Laplace noise

X = (x_1, \dotsc, x_n)

x_i \in [-1,1]^d

with

OPTIMAL?

Theorem

\(Z \sim \mathcal{N}(0, \sigma^2 I)\) with

\sigma \approx \frac{d}{n} \frac{\sqrt{\ln(1/\delta)}}{\varepsilon}

\(\mathcal{M}\) is \((\varepsilon, \delta)\)-DP and

\mathbb{E}\Big[\lVert \mathcal{M}(X) - \mathrm{Mean}(x)\rVert_2\Big] \leq \sigma \approx \frac{d}{n} \frac{\sqrt{\ln(1/\delta)}}{\varepsilon}

The Advantages of Differential Privacy

Worst case: No assumptions on the adversary

Immune to post-processing: Any computation on the output can only improve the privacy guarantees

Composable: DP guarantees of different algorithms compose nicely, even if done in sequence and adaptively

Online Algorithms in General

Online Algorithms:

data is processes one piece at a time

Online Learning

Streaming

???

Fingerprinting Codes

Avoiding Pirated Movies via Fingerprinting

Movie may leak!

Movie Owner

Can we detect one ?

Idea: Mark some of the scenes (Fingerprinting)

Fingerprinting Codes

\begin{pmatrix} 1 & 1 & 0 & \cdots & 1 & 0 \\ 0 & 1 & 1 & \cdots & 0 & 0 \\ 0 & 1 & 1 & \cdots & 0 & 1 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & 1 & 0 & \cdots & 1 & 0 \end{pmatrix}

\(d\) scenes

\(n\) copies of the movie

1 = marked scene

0 = unmarked scene

Code usually randomized

We can do with \(d = 2^n\). Can \(d\) be smaller?

Example of pirating:

\begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix}

\begin{pmatrix} 1 \\ 1 \\ 0 \end{pmatrix}

\begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix}

\(0\) or \(1\)

Only 1

\displaystyle \Bigg (

\displaystyle \Bigg )

Goal of fingerprinting

Given a copy of the movie, trace back one
with probability of
false positive \(o(1/n)\)

Fingerprinting Codes for Lower Bounds

Assume \(\mathcal{M}\) is

accurate

Adversary can detect some \(x_i\)
with high probability

Feed to \(\mathcal{M}\) a marked input \(X\)

\((\varepsilon,\delta)\)-DP implies adversary detects \(x_i\) on \(\mathcal{M}(X')\) with
\(X' = X - \{x_i\} + \{z\}\)

CONTRADICTION

FP codes with \(d = \tilde{O}(n^2)\)

Output -> Pirated Movie

Breaks False Positive Guarantee

[Tardos '08]

The Good, The Bad, and The Ugly of Codes

The Ugly:

Black-box use of FP codes makes it hard to adapt it to

other settings

The Bad:

Very restricted to binary inputs

The Good:

Leads to optimal lower bounds for a variety of problems

Fingerprinting Lemmas

Idea: For some distribution on the input,

the output is highly correlated with the input

Lemma (A 1D Fingerprinting Lemma, [Bun, Stein, Ullman '16])

\(\mathcal{M} \colon [-1,1]^n \to [-1,1]\)

\(p \sim \mathrm{Unif}(\{-1,1\})\)

\(x_1, \dotsc, x_n \in \{\pm 1\}\) random such that \(\mathbb{E}[x_i] = p\)

\displaystyle \mathbb{E}\Big [\sum_{i = 1}^n (\mathcal{M}(X) - p) \cdot (x_i - p) \Big ] \geq \frac{1}{3}

\displaystyle - \mathbb{E} [(\mathcal{M}(X) - p)^2 ]

"Correlation" between \(x_i\) and \(\mathcal{M}(X)\)

\(\mathcal{A}(x_i, \mathcal{M}(X))\)

Fingerprinting Lemma - Picture

\displaystyle \mathbb{E}[\mathcal{A}(x_i, \mathcal{M}(X))]

If \(\mathcal{M}\) is accurate

large

\displaystyle \mathbb{E}[|\mathcal{A}(z, \mathcal{M}(X))|]

If \(z\) indep. of \(X\)

small

\mathcal{A}(z, \mathcal{M}(X)) = (\mathcal{M}(X) - p) \cdot (z - p)

\displaystyle p

\displaystyle \mathcal{M}(X)

\displaystyle z

\displaystyle p

\displaystyle x_3

\displaystyle x_2

\displaystyle x_4

\displaystyle \mathcal{M}(X)

\displaystyle x_1

Depends on distribution of \(X\) and \(p\)

From 1D Lemma to a Code(-Like) Object

Fingerprinting Lemma leads to a kind of fingerprinting code

Bonus: quite transparent and easy to describe

Key Idea: Make \(\tilde{O}(n^2)\) independent copies

\(\mathcal{M} \colon ([-1,1]^d)^n \to [-1,1]\)

\(p \sim \mathrm{Unif}(\{-1,1\})^d\)

\(x_1, \dotsc, x_n \in \{\pm 1\}^d\) random such that \(\mathbb{E}[x_i] = p\)

\displaystyle \mathbb{E}\Big [\sum_{i = 1}^n \langle\mathcal{M}(X) - p, x_i - p\rangle \Big ] \geq d/10

\displaystyle \mathbb{P}\Big [\sum_{i = 1}^n \mathcal{A}(x_i, \mathcal{M}(X)) \leq d/20 \Big ] \leq \frac{1}{n^3}

for \(d = \Omega(n^2 \log n)\)

\(\mathcal{A}(x_i, \mathcal{M}(X))\)

From Lemma to Lower Bounds

\displaystyle \mathbb{E}\Big [\sum_{i = 1}^n \langle \mathcal{M}(X) - p, x_i - p \rangle \Big ] \geq \frac{d}{6}

If \(\mathbb{E}(\lVert \mathcal{M}(X) - p\rVert_2^2) \leq d/6\)

\displaystyle \mathbb{E}\Big [\sum_{i = 1}^n \mathcal{A}(x_i, \mathcal{M}(X)) \Big ] \lesssim n \varepsilon \cdot \sqrt{\mathbb{E}[{\lVert\mathcal{M}(X) - p\rVert_2^2}]} + n d \delta

\displaystyle \mathbb{E}[ \mathcal{A}(x_i, \mathcal{M}(X))] \approx \mathbb{E}[\mathcal{A}(x_i, \mathcal{M}(X_{-i}))]

\displaystyle \implies\frac{d}{n} \lesssim \sqrt{\mathbb{E}[{\lVert\mathcal{M}(X) - p\rVert_2^2}]}

If \(\mathcal{M}\) is accurate, correlation is high

If \(\mathcal{M}\) is \((\varepsilon, \delta)\)-DP, correlation is low

\(\mathcal{A}(x_i, \mathcal{M}(X))\)

Extension to Gaussian Case

Lemma (Gaussian Fingerprinting Lemma)

\(\mathcal{M}\colon \mathbb{R}^n \to \mathbb{R}\)

\(\mu \sim \mathcal{N}(0, 1/2)\)

\(x_1, \dotsc, x_n \sim \mathcal{N}(\mu,1)\)

\displaystyle \mathbb{E}\Big [\sum_{i = 1}^n (\mathcal{M}(X) - \mu) \cdot (x_i - \mu) \Big ] \geq \frac{1}{2}

\displaystyle - \;\mathbb{E} [(\mathcal{M}(X) - \mu)^2 ]

One advantage of lemmas over codes:

Easier to extend to different settings

Implies similar lower bounds for privately estimating the mean of a Gaussian

Lower Bounds for Gaussian

Covariance Matrix Estimation

Work done in collaboration with Nick Harvey

Roadblocks to Fingerprinting Lemmas

\displaystyle x_1, x_2, \dotsc, x_n \sim \mathcal{N}(0, \Sigma)

\displaystyle \Sigma \succ 0

Unknown Covariance Matrix

on \(\mathbb{R}^d\)

To get a Fingerprinting Lemma, we need random \(\Sigma\)

Most FPLs are \(d = 1\), and then use independent copies

leads to limited lower bounds for covariance estimation

[Kamath, Mouzakis, Singhal '22]

We can use diagonally dominant matrices, but

0 has error \(O(1)\)

\mathbb{E}[\lVert \mathcal{M}(X) - 0 \rVert_F^2 ] = O(1)

Can't lower bound accuracy of algorithms with \(\omega(1)\) error

Diagonal

= \frac{3}{4} \pm \frac{1}{4d}

Off-diagonal

= \pm \frac{1}{2d}

Our Results

Theorem

For any \((\varepsilon, \delta)\)-DP algorithm \(\mathcal{M}\) such that

\displaystyle \mathbb{E}\big[\lVert\mathcal{M}(X) - \Sigma\rVert_F^2\big] \leq \alpha^2 = O(d)

and

\displaystyle \delta = O\Big( \frac{1}{n \ln n}\Big)

we have

\displaystyle n = \Omega\Big(\frac{d^2}{\alpha\varepsilon}\Big)

Our results covers both regimes

Nearly highest reasonable value

\displaystyle \delta = \tilde O\Big(\frac{1}{d^2}\Big) = o\Big(\frac{1}{n}\Big)

\displaystyle \alpha = O(1)

[Kamath et al. 22]

Previous lower bounds required

\displaystyle n = \Omega\big(\tfrac{d^2}{\alpha\varepsilon}\big)

\displaystyle \Bigg \{

[Narayanan 23]

Main Contribution: Fingerprinting Lemma without independence

Which Distribution to Use?

Wishart Distribution

Our results use a very natural distribution:

\displaystyle \Sigma = \frac{1}{2d} \; G \; G^{T}

\(d \times 2d\) random Gaussian matrix

\displaystyle \succeq 0

Natural distribution over PSD matrices

Entries are highly correlated

A Different Correlation Statistic

A Peek Into the Proof for 1D

Lemma (Gaussian Fingerprinting Lemma)

\(\mu \sim \mathcal{N}(0, 1/2)\)

\(x_1, \dotsc, x_n \sim \mathcal{N}(\mu,1)\)

\displaystyle \mathbb{E}\Big [\sum_{i = 1}^n (\mathcal{M}(X) - \mu) \cdot (x_i - \mu) \Big ] \geq \frac{1}{2}

\displaystyle - \;\mathbb{E} [(\mathcal{M}(X) - \mu)^2 ]

Claim 1

\displaystyle \mathbb{E}\Big [\sum_{i = 1}^n (\mathcal{M}(X) - \mu) \cdot (x_i - \mu) \Big ] = g'(\mu)

\displaystyle X

\displaystyle g(\mu) = \mathbb{E}_X[\mathcal{M}(X)]

Claim 2

\displaystyle \mathbb{E}[ g'(\mu)] = 2 \mathbb{E}[ g(\mu) \mu]

Stein's Lemma

Follows from integration by parts

\displaystyle \mathbb{E}[ g'(\mu)] = \int g'(\mu) \cdot p(\mu) \mathrm{d} \mu

A Peek Into the Proof of New FP Lemma

Fingerprinting Lemma

Need to Lower Bound

\displaystyle \mathbb{E}\Big[ \sum_{i,j }\partial_{ij} \; g(\Sigma)_{ij}\Big]

\displaystyle g(\Sigma) = \mathbb{E}[\mathcal{M}(X)]

\(\Sigma \sim\) Wishart leads to elegant analysis

Stein-Haff Identity

"Move the derivative" from \(g\) to \(p\) with integration by parts

\displaystyle \mathrm{div} g(\Sigma)

\displaystyle \mathbb{E}[ \mathrm{div} g(\Sigma)] = \int \mathrm{div} g(\Sigma) \cdot p(\Sigma) \mathrm{d}\Sigma

Stokes' Theorem

CombΘ Seminar

for Lower Bounds

Differential Privacy

An Introduction to

Fingerprinting Techniques

and

Victor Sanches Portella

November, 2024

cs.ubc.ca/~victorsp