Knowledge-Injected
Federated Learning
Zhenan Fan
Huawei Technologies Canada
Midwest Optimization Meeting 2022
Collaborators:
Zirui Zhou, Jian Pei, Michael P. Friedlander,
Jiajie Hu, Chengliang Li, Yong Zhang
Outline
1
Motivating Case Study
Knowledge-Injected Federated Learning
Numerical Results
2
3
Coal to Make Coke and Steel


https://www.uky.edu/KGS/coal/coal-for-cokesteel.php
Coal-Mixing in Coking Process


Challenging as no direct formula
Based on experience and knowledge
largely affects cost

Task Description
Goal: improve the expert's prediction model with machine learning
Data scarcity: collecting data is expensive and time consuming
We unite 4 coking industries to collaboratively work on this task
Challeges
local datasets have different distributions
industries have different expert(knowledge) models
privacy of local datasets and knowledge models has to be preserved
Multiclass Classification

D={(x(i),y(i)}i=1N⊂X×{1,…,k}∼F
\red{\mathcal{D}} = \left\{(\blue{x^{(i)}}, \green{y^{(i)}}\right\}_{i=1}^N \subset \blue{\mathcal{X}} \times \green{\{1,\dots,k\}} \sim \purple{\mathcal{F}}
training set
data instance
(features of raw coal)
feature space
label
(quality of the final coke)
label space
data distribution
Task
Findf:X→{1,…,k}such thatE(x,y)∼F[f(x)=y]is small.
\text{Find}\enspace
f: \mathcal{X} \to \{1,\dots,k\}
\enspace\text{such that}\enspace
\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{F}}[f(x) \neq y]
\enspace\text{is small.}
Setting
orE(x,y)∼D[f(x)=y]is small.
\enspace\text{or}\enspace
\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[f(x) \neq y]
\enspace\text{is small.}
Knowledge-based Models

Prediction-type Knowledge Model (P-KM)
gp:X→{1,…,k}such thatgp(x)is a point estimation fory∀(x,y)∼F
g_p: \mathcal{X} \to \{1,\dots,k\}
\enspace\text{such that}\enspace
g_p(x)
\enspace\text{is a point estimation for}\enspace
y
\enspace\forall (x,y) \sim \mathcal{F}
Range-type Knowledge Model (R-KM)
gr:X→2{1,…,k}such thaty⊆gr(x)∀(x,y)∼F
g_r: \mathcal{X} \to 2^{\{1,\dots,k\}}
\enspace\text{such that}\enspace
y \subseteq g_r(x)
\enspace\forall (x,y) \sim \mathcal{F}
Eg. Mechanistic prediction models, such as an differential equation that describes the underlying physical process.
Eg. Can be derived from the causality of the input-output relationship.
(k=3,gp(x)=2)
\red{(k = 3,\enspace g_p(x) = 2)}
(k=3,gr(x)={2,3})
\red{(k = 3,\enspace g_r(x) = \{2, 3\})}
Federated Learning with Knowledge-based Models

M clients and a central server.
training setDm∼Fm
\text{training set}\enspace
\mathcal{D}^m \sim \purple{\mathcal{F}^m}
conditional data distribution depending on
P-KMgpmfor distributionFm
\text{P-KM}\enspace
g_p^m
\enspace\text{for distribution}\enspace
\mathcal{F}^m
R-KMgrmfor distributionFm
\text{R-KM}\enspace
g_r^m
\enspace\text{for distribution}\enspace
\mathcal{F}^m
F
\purple{\mathcal{F}}
Each client m has

gpmagrees withgrm(gpm(x)∈grm(x)∀x)
g_p^m
\enspace\text{agrees with}\enspace
g_r^m
\enspace \red{(g_p^m(x) \in g_r^m(x) \enspace \forall x)}
Task Description

each client m obatins a personalized predictive model
fm:X→Δk:={p∈Rk∣∑pi=1}
f^m: \mathcal{X} \to \Delta^k \coloneqq \{p \in \mathbb{R}^k \mid \sum p_i = 1\}
fmutilize the local P-KMgpmwith controllable trust level
f^m \enspace\text{utilize the local P-KM}\enspace g^m_p \enspace\text{with controllable trust level}
fmagrees with local R-KMgpmi.e.{i∣fm(x)i>0}⊆grm(x)∀x∈X
f^m \enspace\text{agrees with local R-KM}\enspace g^m_p
\enspace\text{i.e.}\enspace
\{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace
\forall x \in \mathcal{X}
Design a federated learning framework such that
clients can benefit from others' datasets and knowledge
privacy of local datasets and local KMs needs to be protected
Direct Formulation Invokes Infinitely Many Constraints

Simple setting
Single client withX=Rd
\text{Single client with}\enspace \mathcal{X} = \mathbb{R}^d
Logistic modelf(θ;x)=softmax(θTx)withθ∈Rd×k
\text{Logistic model}\enspace f(\theta; x) = \blue{\mathop{softmax}}(\theta^T x) \enspace\text{with}\enspace \theta\in\mathbb{R}^{d\times k}
softmax(z∈Rk)i=∑jexp(zj)exp(zi)
\blue{
\mathop{softmax}(z \in \mathbb{R}^k)_i = \dfrac{\exp(z_i)}{\sum_j \exp(z_j)}
}
(f(θ;⋅):Rd→Δk)
\red{(f(\theta; \cdot): \mathbb{R}^d \to \Delta^k)}
Loss functionL(θ)=∣D∣1(x,y)∈D∑crossentropy(f(θ;x),y)
\text{Loss function}\enspace \mathcal{L}(\theta) = \frac{1}{|\mathcal{D}|} \sum\limits_{(x,y) \in \mathcal{D}}
\blue{\mathop{crossentropy}}(f(\theta; x), y)
crossentropy(p∈Δk,y∈{1,…,k})=−log(py)
\blue{
\mathop{crossentropy}(p\in\Delta^k, y\in\{1,\dots,k\}) = -\log(p_y)
}
Challenging optimization problem
θ∈Rd×kminL(θ)s.t.{i∣f(θ;x)i>0}⊆gr(x)∀x∈Rd
\min\limits_{\theta \in \mathbb{R}^{d\times k}} \enspace
\mathcal{L}(\theta)
\enspace\text{s.t.}\enspace
\{i \mid f(\theta; x)_i > 0\} \subseteq g_r(x)
\enspace \forall x \in \mathbb{R}^d
(infinitely many constraints)
\red{(\text{infinitely many constraints})}
Architecture Design

The server provides a general deep learning model
f(θ;⋅):X→Rk
f(\red{\theta}; \cdot): \mathcal{X} \to \mathbb{R}^k
learnable model parameters

Function transformation
Tλ,gp,gr(f)(x)=(1−λ)softmax(f(x)+zr)+λzp
\mathcal{T}_{\lambda, g_p, g_r}(f)(x) =
(1-\lambda)\mathop{softmax}(f(x) + z_r) + \lambda z_p
where
(zr)i={0−∞ifi∈gr(x)otherwiseand(zp)i={10ifi=gp(x)otherwise
(z_r)_i =
\begin{cases}
0 &\text{if}\enspace i \in g_r(x)\\
-\infty &\text{otherwise}
\end{cases}
\enspace\text{and}\enspace
(z_p)_i =
\begin{cases}
1 &\text{if}\enspace i = g_p(x)\\
0 &\text{otherwise}
\end{cases}
Personalized model
fm(θ;⋅):=Tλm,gpm,grm(f(θ;⋅))
f^m(\red{\theta}; \cdot) \coloneqq \mathcal{T}_{\lambda^m, g_p^m, g_r^m}(f(\red{\theta}; \cdot))
Properties of Personalized Model

fm(θ;⋅)is a valid predictive model, i.e.,fm(θ;x)∈Δk∀x∈X
f^m(\theta; \cdot)
\enspace\text{is a valid predictive model, i.e.,}\enspace
f^m(\theta; x) \in \Delta^k \enspace \forall x \in \mathcal{X}
λm∈[0,1]controls the trust-level of the local P-KMgpm
\lambda^m \in [0,1]
\enspace\text{controls the trust-level of the local P-KM}\enspace
g^m_p
⟨fm(θ;x),gpm(x)⟩≥λm∀x∈X
\langle f^m(\theta; x), g^m_p(x) \rangle \geq \lambda^m \enspace \forall x \in \mathcal{X}
Ifλm>0.5thenfmcoincides withgpm
\text{If}\enspace \lambda^m > 0.5
\enspace\text{then}\enspace
f^m
\enspace\text{coincides with}\enspace
g^m_p
argmaxifm(θ;x)=gpm(x)
\argmax_i f^m(\theta; x) = g_p^m(x)
fm(θ;⋅)agrees with local R-KMgrm
f^m(\theta; \cdot)
\enspace\text{agrees with local R-KM}\enspace
g^m_r
{i∣fm(x)i>0}⊆grm(x)∀x∈X
\{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace
\forall x \in \mathcal{X}
Optimization

Optimization problem
θminL(θ):=i=1∑MLm(θ)withLm(θ)=∣Dm∣1(x,y)∈Dm∑crossentropy(fm(θ;x),y)
\min\limits_{\theta}\enspace\red{\mathcal{L}}(\theta) \coloneqq \sum\limits_{i=1}^M \red{\mathcal{L}^m}(\theta)
\enspace\text{with}\enspace
\mathcal{L}^m(\theta) = \frac{1}{|\mathcal{D}^m|}\sum\limits_{(x,y) \in \mathcal{D}^m} \mathop{crossentropy}(f^m(\theta; x), y)
FedAvg [McMahan et al.'17]
server select a subset of clientsS⊆{1,…,M}and send them latest modelθ
\text{server select a subset of clients}\enspace S \subseteq \{1,\dots,M\}
\enspace\text{and send them latest model}\enspace
\theta
each selected clientmlocally updates modelθm←θm−η∇Lm(θ)(d times)
\text{each selected client}\enspace m
\enspace\text{locally updates model}\enspace
\theta^m \leftarrow \theta^m - \eta\nabla\mathcal{L}^m(\theta)
\enspace (\text{d times})
server updates global model by aggregating local modelsθ←∣S∣1m∈S∑θm
\text{server updates global model by aggregating local models}\enspace
\theta \leftarrow \frac{1}{|S|}\sum\limits_{m\in S} \theta^m
global loss
local loss
Numerical Results (Case-study)


Test accuracy
TA=∣Dtestm∣1(x,y)∈Dtestm∑I({fm(θ;x)=y})
\text{TA} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}} \mathbb{I}(\{f^m(\theta; x) = y\})
Percentage of violation
POV=∣Dtestm∣1(x,y)∈Dtestm∑I({fm(θ;x)∈/grm(x)})
\text{POV} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}}
\mathbb{I}(\{f^m(\theta; x) \notin g_r^m(x)\})
Numerical Results (Public Datasets)

Datasets
Covtype:number of classesk=7,feature spaceX=R54
\textbf{Covtype:}\enspace
\text{number of classes}\enspace k = 7,\enspace
\text{feature space}\enspace \mathcal{X} = \mathbb{R}^{54}
FMNIST:number of classesk=10,feature spaceX=R28×28
\textbf{FMNIST:}\enspace
\text{number of classes}\enspace k = 10,\enspace
\text{feature space}\enspace \mathcal{X} = \mathbb{R}^{28\times 28}
Data distribution
Each client only gets samples from some classes.
P-KM
We train a deep model with a subset of features.
R-KM
We construct a hashmap to guarantee the true label is within the range.
Numerical Results (Public Datasets)


Open-source Package https: //github.com/ZhenanFanUBC/FedMech.jl
Paper Fan, Zhenan, Zirui Zhou, Jian Pei, Michael P. Friedlander, Jiajie Hu, Chengliang Li, and Yong Zhang. "Knowledge-Injected Federated Learning." arXiv preprint arXiv:2208.07530 (2022).
Thank you! Questions?
Knowledge-Injected Federated Learning Zhenan Fan
Huawei Technologies Canada
Midwest Optimization Meeting 2022
Collaborators:
Zirui Zhou, Jian Pei, Michael P. Friedlander,
Jiajie Hu, Chengliang Li, Yong Zhang
Knowledge Injected Federated Learning
By Zhenan Fan
Knowledge Injected Federated Learning
Slides for the talk at the 24th Midwest Optimization Meeting https://www.math.uwaterloo.ca/~hwolkowi/Univ.Waterloo.24thMidwestOptimizationMeeting.html
- 405