Outline

1

Motivating Case Study

Knowledge-Injected Federated Learning

Numerical Results

2

3

Coal to Make Coke and Steel

https://www.uky.edu/KGS/coal/coal-for-cokesteel.php

Coal-Mixing in Coking Process

Challenging as no direct formula

Based on experience and knowledge

largely affects cost

Task Description

Goal: improve the expert's prediction model with machine learning

Data scarcity: collecting data is expensive and time consuming

We unite 4 coking industries to collaboratively work on this task

Challeges

local datasets have different distributions

industries have different expert(knowledge) models

privacy of local datasets and knowledge models has to be preserved

Multiclass Classification

\red{\mathcal{D}} = \left\{(\blue{x^{(i)}}, \green{y^{(i)}}\right\}_{i=1}^N \subset \blue{\mathcal{X}} \times \green{\{1,\dots,k\}} \sim \purple{\mathcal{F}}

\red{\mathcal{D}} = \left\{(\blue{x^{(i)}}, \green{y^{(i)}}\right\}_{i=1}^N \subset \blue{\mathcal{X}} \times \green{\{1,\dots,k\}} \sim \purple{\mathcal{F}}

training set

data instance

(features of raw coal)

feature space

label 
(quality of the final coke)

label space

data distribution

Task

\text{Find}\enspace f: \mathcal{X} \to \{1,\dots,k\} \enspace\text{such that}\enspace \mathop{\mathbb{E}}_{(x,y)\sim\mathcal{F}}[f(x) \neq y] \enspace\text{is small.}

\text{Find}\enspace f: \mathcal{X} \to \{1,\dots,k\} \enspace\text{such that}\enspace \mathop{\mathbb{E}}_{(x,y)\sim\mathcal{F}}[f(x) \neq y] \enspace\text{is small.}

Setting

\enspace\text{or}\enspace \mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[f(x) \neq y] \enspace\text{is small.}

\enspace\text{or}\enspace \mathop{\mathbb{E}}_{(x,y)\sim\mathcal{D}}[f(x) \neq y] \enspace\text{is small.}

Knowledge-based Models

Prediction-type Knowledge Model (P-KM)

g_p: \mathcal{X} \to \{1,\dots,k\} \enspace\text{such that}\enspace g_p(x) \enspace\text{is a point estimation for}\enspace y \enspace\forall (x,y) \sim \mathcal{F}

g_p: \mathcal{X} \to \{1,\dots,k\} \enspace\text{such that}\enspace g_p(x) \enspace\text{is a point estimation for}\enspace y \enspace\forall (x,y) \sim \mathcal{F}

Range-type Knowledge Model (R-KM)

g_r: \mathcal{X} \to 2^{\{1,\dots,k\}} \enspace\text{such that}\enspace y \subseteq g_r(x) \enspace\forall (x,y) \sim \mathcal{F}

g_r: \mathcal{X} \to 2^{\{1,\dots,k\}} \enspace\text{such that}\enspace y \subseteq g_r(x) \enspace\forall (x,y) \sim \mathcal{F}

Eg. Mechanistic prediction models, such as an differential equation that describes the underlying physical process.

Eg. Can be derived from the causality of the input-output relationship.

\red{(k = 3,\enspace g_p(x) = 2)}

\red{(k = 3,\enspace g_p(x) = 2)}

\red{(k = 3,\enspace g_r(x) = \{2, 3\})}

\red{(k = 3,\enspace g_r(x) = \{2, 3\})}

Federated Learning with Knowledge-based Models

M clients and a central server.

\text{training set}\enspace \mathcal{D}^m \sim \purple{\mathcal{F}^m}

\text{training set}\enspace \mathcal{D}^m \sim \purple{\mathcal{F}^m}

conditional data distribution depending on

\text{P-KM}\enspace g_p^m \enspace\text{for distribution}\enspace \mathcal{F}^m

\text{P-KM}\enspace g_p^m \enspace\text{for distribution}\enspace \mathcal{F}^m

\text{R-KM}\enspace g_r^m \enspace\text{for distribution}\enspace \mathcal{F}^m

\text{R-KM}\enspace g_r^m \enspace\text{for distribution}\enspace \mathcal{F}^m

\purple{\mathcal{F}}

\purple{\mathcal{F}}

Each client m has

g_p^m \enspace\text{agrees with}\enspace g_r^m \enspace \red{(g_p^m(x) \in g_r^m(x) \enspace \forall x)}

g_p^m \enspace\text{agrees with}\enspace g_r^m \enspace \red{(g_p^m(x) \in g_r^m(x) \enspace \forall x)}

Task Description

each client m obatins a personalized predictive model

f^m: \mathcal{X} \to \Delta^k \coloneqq \{p \in \mathbb{R}^k \mid \sum p_i = 1\}

f^m: \mathcal{X} \to \Delta^k \coloneqq \{p \in \mathbb{R}^k \mid \sum p_i = 1\}

f^m \enspace\text{utilize the local P-KM}\enspace g^m_p \enspace\text{with controllable trust level}

f^m \enspace\text{utilize the local P-KM}\enspace g^m_p \enspace\text{with controllable trust level}

f^m \enspace\text{agrees with local R-KM}\enspace g^m_p \enspace\text{i.e.}\enspace \{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace \forall x \in \mathcal{X}

f^m \enspace\text{agrees with local R-KM}\enspace g^m_p \enspace\text{i.e.}\enspace \{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace \forall x \in \mathcal{X}

Design a federated learning framework such that

clients can benefit from others' datasets and knowledge

privacy of local datasets and local KMs needs to be protected

Direct Formulation Invokes Infinitely Many Constraints

Simple setting

\text{Single client with}\enspace \mathcal{X} = \mathbb{R}^d

\text{Single client with}\enspace \mathcal{X} = \mathbb{R}^d

\text{Logistic model}\enspace f(\theta; x) = \blue{\mathop{softmax}}(\theta^T x) \enspace\text{with}\enspace \theta\in\mathbb{R}^{d\times k}

\text{Logistic model}\enspace f(\theta; x) = \blue{\mathop{softmax}}(\theta^T x) \enspace\text{with}\enspace \theta\in\mathbb{R}^{d\times k}

\blue{ \mathop{softmax}(z \in \mathbb{R}^k)_i = \dfrac{\exp(z_i)}{\sum_j \exp(z_j)} }

\blue{ \mathop{softmax}(z \in \mathbb{R}^k)_i = \dfrac{\exp(z_i)}{\sum_j \exp(z_j)} }

\red{(f(\theta; \cdot): \mathbb{R}^d \to \Delta^k)}

\red{(f(\theta; \cdot): \mathbb{R}^d \to \Delta^k)}

\text{Loss function}\enspace \mathcal{L}(\theta) = \frac{1}{|\mathcal{D}|} \sum\limits_{(x,y) \in \mathcal{D}} \blue{\mathop{crossentropy}}(f(\theta; x), y)

\text{Loss function}\enspace \mathcal{L}(\theta) = \frac{1}{|\mathcal{D}|} \sum\limits_{(x,y) \in \mathcal{D}} \blue{\mathop{crossentropy}}(f(\theta; x), y)

\blue{ \mathop{crossentropy}(p\in\Delta^k, y\in\{1,\dots,k\}) = -\log(p_y) }

\blue{ \mathop{crossentropy}(p\in\Delta^k, y\in\{1,\dots,k\}) = -\log(p_y) }

Challenging optimization problem

\min\limits_{\theta \in \mathbb{R}^{d\times k}} \enspace \mathcal{L}(\theta) \enspace\text{s.t.}\enspace \{i \mid f(\theta; x)_i > 0\} \subseteq g_r(x) \enspace \forall x \in \mathbb{R}^d

\min\limits_{\theta \in \mathbb{R}^{d\times k}} \enspace \mathcal{L}(\theta) \enspace\text{s.t.}\enspace \{i \mid f(\theta; x)_i > 0\} \subseteq g_r(x) \enspace \forall x \in \mathbb{R}^d

\red{(\text{infinitely many constraints})}

\red{(\text{infinitely many constraints})}

Architecture Design

The server provides a general deep learning model

f(\red{\theta}; \cdot): \mathcal{X} \to \mathbb{R}^k

f(\red{\theta}; \cdot): \mathcal{X} \to \mathbb{R}^k

learnable model parameters

Function transformation

\mathcal{T}_{\lambda, g_p, g_r}(f)(x) = (1-\lambda)\mathop{softmax}(f(x) + z_r) + \lambda z_p

\mathcal{T}_{\lambda, g_p, g_r}(f)(x) = (1-\lambda)\mathop{softmax}(f(x) + z_r) + \lambda z_p

where

(z_r)_i = \begin{cases} 0 &\text{if}\enspace i \in g_r(x)\\ -\infty &\text{otherwise} \end{cases} \enspace\text{and}\enspace (z_p)_i = \begin{cases} 1 &\text{if}\enspace i = g_p(x)\\ 0 &\text{otherwise} \end{cases}

(z_r)_i = \begin{cases} 0 &\text{if}\enspace i \in g_r(x)\\ -\infty &\text{otherwise} \end{cases} \enspace\text{and}\enspace (z_p)_i = \begin{cases} 1 &\text{if}\enspace i = g_p(x)\\ 0 &\text{otherwise} \end{cases}

Personalized model

f^m(\red{\theta}; \cdot) \coloneqq \mathcal{T}_{\lambda^m, g_p^m, g_r^m}(f(\red{\theta}; \cdot))

f^m(\red{\theta}; \cdot) \coloneqq \mathcal{T}_{\lambda^m, g_p^m, g_r^m}(f(\red{\theta}; \cdot))

Properties of Personalized Model

f^m(\theta; \cdot) \enspace\text{is a valid predictive model, i.e.,}\enspace f^m(\theta; x) \in \Delta^k \enspace \forall x \in \mathcal{X}

f^m(\theta; \cdot) \enspace\text{is a valid predictive model, i.e.,}\enspace f^m(\theta; x) \in \Delta^k \enspace \forall x \in \mathcal{X}

\lambda^m \in [0,1] \enspace\text{controls the trust-level of the local P-KM}\enspace g^m_p

\lambda^m \in [0,1] \enspace\text{controls the trust-level of the local P-KM}\enspace g^m_p

\langle f^m(\theta; x), g^m_p(x) \rangle \geq \lambda^m \enspace \forall x \in \mathcal{X}

\langle f^m(\theta; x), g^m_p(x) \rangle \geq \lambda^m \enspace \forall x \in \mathcal{X}

\text{If}\enspace \lambda^m > 0.5 \enspace\text{then}\enspace f^m \enspace\text{coincides with}\enspace g^m_p

\text{If}\enspace \lambda^m > 0.5 \enspace\text{then}\enspace f^m \enspace\text{coincides with}\enspace g^m_p

\argmax_i f^m(\theta; x) = g_p^m(x)

\argmax_i f^m(\theta; x) = g_p^m(x)

f^m(\theta; \cdot) \enspace\text{agrees with local R-KM}\enspace g^m_r

f^m(\theta; \cdot) \enspace\text{agrees with local R-KM}\enspace g^m_r

\{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace \forall x \in \mathcal{X}

\{i \mid f^m(x)_i > 0\} \subseteq g^m_r(x)\enspace \forall x \in \mathcal{X}

Optimization

Optimization problem

\min\limits_{\theta}\enspace\red{\mathcal{L}}(\theta) \coloneqq \sum\limits_{i=1}^M \red{\mathcal{L}^m}(\theta) \enspace\text{with}\enspace \mathcal{L}^m(\theta) = \frac{1}{|\mathcal{D}^m|}\sum\limits_{(x,y) \in \mathcal{D}^m} \mathop{crossentropy}(f^m(\theta; x), y)

\min\limits_{\theta}\enspace\red{\mathcal{L}}(\theta) \coloneqq \sum\limits_{i=1}^M \red{\mathcal{L}^m}(\theta) \enspace\text{with}\enspace \mathcal{L}^m(\theta) = \frac{1}{|\mathcal{D}^m|}\sum\limits_{(x,y) \in \mathcal{D}^m} \mathop{crossentropy}(f^m(\theta; x), y)

FedAvg [McMahan et al.'17]

\text{server select a subset of clients}\enspace S \subseteq \{1,\dots,M\} \enspace\text{and send them latest model}\enspace \theta

\text{server select a subset of clients}\enspace S \subseteq \{1,\dots,M\} \enspace\text{and send them latest model}\enspace \theta

\text{each selected client}\enspace m \enspace\text{locally updates model}\enspace \theta^m \leftarrow \theta^m - \eta\nabla\mathcal{L}^m(\theta) \enspace (\text{d times})

\text{each selected client}\enspace m \enspace\text{locally updates model}\enspace \theta^m \leftarrow \theta^m - \eta\nabla\mathcal{L}^m(\theta) \enspace (\text{d times})

\text{server updates global model by aggregating local models}\enspace \theta \leftarrow \frac{1}{|S|}\sum\limits_{m\in S} \theta^m

\text{server updates global model by aggregating local models}\enspace \theta \leftarrow \frac{1}{|S|}\sum\limits_{m\in S} \theta^m

global loss

local loss

Numerical Results (Case-study)

Test accuracy

\text{TA} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}} \mathbb{I}(\{f^m(\theta; x) = y\})

\text{TA} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}} \mathbb{I}(\{f^m(\theta; x) = y\})

Percentage of violation

\text{POV} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}} \mathbb{I}(\{f^m(\theta; x) \notin g_r^m(x)\})

\text{POV} = \frac{1}{|\mathcal{D}^m_{\text{test}}|} \sum\limits_{(x,y) \in \mathcal{D}^m_{\text{test}}} \mathbb{I}(\{f^m(\theta; x) \notin g_r^m(x)\})

Numerical Results (Public Datasets)

Datasets

\textbf{Covtype:}\enspace \text{number of classes}\enspace k = 7,\enspace \text{feature space}\enspace \mathcal{X} = \mathbb{R}^{54}

\textbf{Covtype:}\enspace \text{number of classes}\enspace k = 7,\enspace \text{feature space}\enspace \mathcal{X} = \mathbb{R}^{54}

\textbf{FMNIST:}\enspace \text{number of classes}\enspace k = 10,\enspace \text{feature space}\enspace \mathcal{X} = \mathbb{R}^{28\times 28}

\textbf{FMNIST:}\enspace \text{number of classes}\enspace k = 10,\enspace \text{feature space}\enspace \mathcal{X} = \mathbb{R}^{28\times 28}

Data distribution

Each client only gets samples from some classes.

P-KM

We train a deep model with a subset of features.

R-KM

We construct a hashmap to guarantee the true label is within the range.

Numerical Results (Public Datasets)

Open-source Package https: //github.com/ZhenanFanUBC/FedMech.jl

Paper Fan, Zhenan, Zirui Zhou, Jian Pei, Michael P. Friedlander, Jiajie Hu, Chengliang Li, and Yong Zhang. "Knowledge-Injected Federated Learning." arXiv preprint arXiv:2208.07530 (2022).

Knowledge-Injected

Federated Learning

Outline

1

Motivating Case Study

Knowledge-Injected Federated Learning

Numerical Results

2

3

Coal to Make Coke and Steel

Coal-Mixing in Coking Process

Task Description

Multiclass Classification

Knowledge-based Models

Federated Learning with Knowledge-based Models

Task Description

Direct Formulation Invokes Infinitely Many Constraints

Architecture Design

Properties of Personalized Model

Optimization

Numerical Results (Case-study)

Numerical Results (Public Datasets)

Numerical Results (Public Datasets)

Thank you! Questions?

Knowledge Injected Federated Learning

Knowledge Injected Federated Learning

Zhenan Fan

Knowledge-Injected

Federated Learning

Knowledge Injected Federated Learning

More from Zhenan Fan