Knowledge-Injected

Federated Learning

Zhenan Fan
Huawei Technologies Canada

Midwest Optimization Meeting 2022


Collaborators:
Zirui Zhou, Jian Pei, Michael P. Friedlander,
Jiajie Hu, Chengliang Li, Yong Zhang

Outline

1

Motivating Case Study

Knowledge-Injected Federated Learning

Numerical Results

2

3

Coal to Make Coke and Steel

https://www.uky.edu/KGS/coal/coal-for-cokesteel.php

Coal-Mixing in Coking Process

Challenging as no direct formula
Based on experience and knowledge
largely affects cost

Task Description

Goal: improve the expert's prediction model with machine learning
Data scarcity: collecting data is expensive and time consuming
We unite 4 coking industries to collaboratively work on this task
Challeges
local datasets have different distributions
industries have different expert(knowledge) models
privacy of local datasets and knowledge models has to be preserved

Multiclass Classification

\red{\mathcal{D}} = \left\{(\blue{x^{(i)}}, \green{y^{(i)}}\right\}_{i=1}^N \subset \blue{\mathcal{X}} \times \green{\{1,\dots,k\}} \sim \purple{\mathcal{F}}
training set
data instance

(features of raw coal)

feature space
label 
(quality of the final coke)
label space
data distribution
Task
\text{Find}\enspace f: \mathcal{X} \to \{1,\dots,k\} \enspace\text{such that}\enspace \mathop{\mathbb{E}}_{(x,y)\sim\mathcal{F}}[f(x) \neq y] \enspace\text{is small.}
Setting
\enspace\text{or}\enspace \mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[f(x) \neq y] \enspace\text{is small.}

Knowledge-based Models

Prediction-type Knowledge Model (P-KM)
g_p: \mathcal{X} \to \{1,\dots,k\} \enspace\text{such that}\enspace g_p(x) \enspace\text{is a point estimation for}\enspace y \enspace\forall (x,y) \sim \mathcal{F}
Range-type Knowledge Model (R-KM)
g_r: \mathcal{X} \to 2^{\{1,\dots,k\}} \enspace\text{such that}\enspace y \subseteq g_r(x) \enspace\forall (x,y) \sim \mathcal{F}
Eg. Mechanistic prediction models, such as an differential equation that describes the underlying physical process.
Eg. Can be derived from the causality of the input-output relationship.
\red{(k = 3,\enspace g_p(x) = 2)}
\red{(k = 3,\enspace g_r(x) = \{2, 3\})}

Federated Learning with Knowledge-based Models

M clients and a central server.
\text{training set}\enspace \mathcal{D}^m \sim \purple{\mathcal{F}^m}
conditional data distribution depending on
\text{P-KM}\enspace g_p^m \enspace\text{for distribution}\enspace \mathcal{F}^m
\text{R-KM}\enspace g_r^m \enspace\text{for distribution}\enspace \mathcal{F}^m
\purple{\mathcal{F}}
Each client m has
g_p^m \enspace\text{agrees with}\enspace g_r^m \enspace \red{(g_p^m(x) \in g_r^m(x) \enspace \forall x)}

Task Description

each client m obatins a personalized predictive model 
f^m: \mathcal{X} \to \Delta^k \coloneqq \{p \in \mathbb{R}^k \mid \sum p_i = 1\}
f^m \enspace\text{utilize the local P-KM}\enspace g^m_p \enspace\text{with controllable trust level}
f^m \enspace\text{agrees with local R-KM}\enspace g^m_p \enspace\text{i.e.}\enspace \{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace \forall x \in \mathcal{X}
Design a federated learning framework such that 
clients can benefit from others' datasets and knowledge 
privacy of local datasets and local KMs needs to be protected

Direct Formulation Invokes Infinitely Many Constraints

Simple setting 
\text{Single client with}\enspace \mathcal{X} = \mathbb{R}^d
\text{Logistic model}\enspace f(\theta; x) = \blue{\mathop{softmax}}(\theta^T x) \enspace\text{with}\enspace \theta\in\mathbb{R}^{d\times k}
\blue{ \mathop{softmax}(z \in \mathbb{R}^k)_i = \dfrac{\exp(z_i)}{\sum_j \exp(z_j)} }
\red{(f(\theta; \cdot): \mathbb{R}^d \to \Delta^k)}
\text{Loss function}\enspace \mathcal{L}(\theta) = \frac{1}{|\mathcal{D}|} \sum\limits_{(x,y) \in \mathcal{D}} \blue{\mathop{crossentropy}}(f(\theta; x), y)
\blue{ \mathop{crossentropy}(p\in\Delta^k, y\in\{1,\dots,k\}) = -\log(p_y) }
Challenging optimization problem
\min\limits_{\theta \in \mathbb{R}^{d\times k}} \enspace \mathcal{L}(\theta) \enspace\text{s.t.}\enspace \{i \mid f(\theta; x)_i > 0\} \subseteq g_r(x) \enspace \forall x \in \mathbb{R}^d
\red{(\text{infinitely many constraints})}

Architecture Design

The server provides a general deep learning model
f(\red{\theta}; \cdot): \mathcal{X} \to \mathbb{R}^k
learnable model parameters
Function transformation
\mathcal{T}_{\lambda, g_p, g_r}(f)(x) = (1-\lambda)\mathop{softmax}(f(x) + z_r) + \lambda z_p
where
(z_r)_i = \begin{cases} 0 &\text{if}\enspace i \in g_r(x)\\ -\infty &\text{otherwise} \end{cases} \enspace\text{and}\enspace (z_p)_i = \begin{cases} 1 &\text{if}\enspace i = g_p(x)\\ 0 &\text{otherwise} \end{cases}
Personalized model
f^m(\red{\theta}; \cdot) \coloneqq \mathcal{T}_{\lambda^m, g_p^m, g_r^m}(f(\red{\theta}; \cdot))

Properties of Personalized Model

f^m(\theta; \cdot) \enspace\text{is a valid predictive model, i.e.,}\enspace f^m(\theta; x) \in \Delta^k \enspace \forall x \in \mathcal{X}
\lambda^m \in [0,1] \enspace\text{controls the trust-level of the local P-KM}\enspace g^m_p
\langle f^m(\theta; x), g^m_p(x) \rangle \geq \lambda^m \enspace \forall x \in \mathcal{X}
\text{If}\enspace \lambda^m > 0.5 \enspace\text{then}\enspace f^m \enspace\text{coincides with}\enspace g^m_p
\argmax_i f^m(\theta; x) = g_p^m(x)
f^m(\theta; \cdot) \enspace\text{agrees with local R-KM}\enspace g^m_r
\{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace \forall x \in \mathcal{X}

Optimization

Optimization problem 
\min\limits_{\theta}\enspace\red{\mathcal{L}}(\theta) \coloneqq \sum\limits_{i=1}^M \red{\mathcal{L}^m}(\theta) \enspace\text{with}\enspace \mathcal{L}^m(\theta) = \frac{1}{|\mathcal{D}^m|}\sum\limits_{(x,y) \in \mathcal{D}^m} \mathop{crossentropy}(f^m(\theta; x), y)
FedAvg [McMahan et al.'17]
\text{server select a subset of clients}\enspace S \subseteq \{1,\dots,M\} \enspace\text{and send them latest model}\enspace \theta
\text{each selected client}\enspace m \enspace\text{locally updates model}\enspace \theta^m \leftarrow \theta^m - \eta\nabla\mathcal{L}^m(\theta) \enspace (\text{d times})
\text{server updates global model by aggregating local models}\enspace \theta \leftarrow \frac{1}{|S|}\sum\limits_{m\in S} \theta^m
global loss
local loss

Numerical Results (Case-study)

Test accuracy 
\text{TA} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}} \mathbb{I}(\{f^m(\theta; x) = y\})
Percentage of violation
\text{POV} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}} \mathbb{I}(\{f^m(\theta; x) \notin g_r^m(x)\})

Numerical Results (Public Datasets)

Datasets
\textbf{Covtype:}\enspace \text{number of classes}\enspace k = 7,\enspace \text{feature space}\enspace \mathcal{X} = \mathbb{R}^{54}
\textbf{FMNIST:}\enspace \text{number of classes}\enspace k = 10,\enspace \text{feature space}\enspace \mathcal{X} = \mathbb{R}^{28\times 28}
Data distribution
Each client only gets samples from some classes.
P-KM
We train a deep model with a subset of features.
R-KM
We construct a hashmap to guarantee the true label is within the range.

Numerical Results (Public Datasets)

Open-source Package  https: //github.com/ZhenanFanUBC/FedMech.jl

Paper  Fan, Zhenan, Zirui Zhou, Jian Pei, Michael P. Friedlander, Jiajie Hu, Chengliang Li, and Yong Zhang. "Knowledge-Injected Federated Learning." arXiv preprint arXiv:2208.07530 (2022).

Thank you! Questions?

Knowledge Injected Federated Learning

By Zhenan Fan

Knowledge Injected Federated Learning

Slides for the talk at the 24th Midwest Optimization Meeting https://www.math.uwaterloo.ca/~hwolkowi/Univ.Waterloo.24thMidwestOptimizationMeeting.html

  • 313