Mixtral of Experts

Mistral AI

8 Jan 2024

Introduction

Mixtral 8x7B, a sparse mixture of experts model (SMoE).
The feedforward block picks from a set of 8 distinct groups of parameters.
At every layer, for every token, a router network chooses two of these groups to process the token and combine their output additively.
Mixtral is pretrained with multilingual data using a context size of 32k tokens.

Sparse Mixture of Experts

The output of the expert layer is given by

$\sum_{i=0}^{n-1} G(x)_i \cdot E_i(x)$

$G(x)_i$ denotes the $n$ -dim output of the gating network for the $i$ -th expert.
$E_i(x)$ is the output of the $i$ -th expert network.
Avoid computing the outputs of experts whose gates are zero.

$G(x) := \text{Softmax}(\text{TopK}(x \cdot W_g))$

MoE Layers

The MoE layer is applied independently per token and replaces the feed-forward (FFN) sub-block of the transformer block.
Mixtral use the same SwiGLU as the expert function $E_i(x)$ and set $K=2$ .
Each token is routed to two SwiGLU sub-blocks with different sets of weights.
The output $y$ for an input token $x$ is computed as:

$y = \sum_{i=0}^{n-1} \text{Softmax}(\text{Top2}(x \cdot W_g))_i \cdot \text{SwiGLU}_i(x).$

Mixture of Experts Layer

Mixture of Experts Architecture

MixtralSparseMoeBlock (HF)

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
    # e.g. batch_size=4, seq_len=8, hidden_dim=16
    batch_size, seq_len, hidden_dim = hidden_states.shape

    # 所有 Batch 的每個 Token 會使用到的 Expert 不一樣
    # router_logits: (batch * sequence_length, n_experts) => (32, 4)
    hidden_states = hidden_states.view(-1, hidden_dim)
    router_logits = self.gate(hidden_states)
    routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)

    # 使用 `torch.topk` 選出權重最高的 K 個 Expert
    routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)

    # 根據選出的 Expert 計算各自的權重佔比
    routing_weights /= routing_weights.sum(dim=-1, keepdim=True)

    # 預先配置輸出 Tensor
    final_hidden_states = torch.zeros((batch_size * seq_len, hidden_dim))

    # 使用 One-Hot Encoding 建立 Expert Mask
    # expert_mask: (32, 2, 4) => (4, 2, 32), n_tokens=32, selected_experts=2, n_experts=4
    expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.n_experts)
    expert_mask = expert_mask.permute(2, 1, 0)

    # 拜訪每個 Expert 計算各自需要處理的 Token
    for expert_idx in range(self.n_experts):
        expert_layer = self.experts[expert_idx]

        # top_x 為 Expert 要處理的 Token 的所在位置 Index
        # idx 代表當前 Expert 是該 Token 第幾名的 Expert
        idx, top_x = torch.where(expert_mask[expert_idx])

        # 如果此 Expert 沒有要處理的 Token 則跳過
        if top_x.shape[0] == 0:
            continue

        # 根據 top_x 將要處理的 Token 選出來
        curr_state = hidden_states[None, top_x].reshape(-1, hidden_dim)

        # 加權計算 Token 輸出
        curr_hidden_states = expert_layer(curr_state) * routing_weights[top_x, idx, None]

        # 將 Hidden States 塞回去
        final_hidden_states.index_add_(0, top_x, curr_hidden_states)

    final_hidden_states = final_hidden_states.reshape(batch_size, seq_len, hidden_dim)

    return final_hidden_states, router_logits

Sparse & Active Parameters

If one increases $n$ while keeping $K$ fixed, one can increase the model's parameter count while keeping its computational cost effectively constant.
The model's total parameter count (sparse) grows with $n$ .
The number of parameters used for processing an token (active) grows with $K$ .

MoE Parallelism

Can be distributed to multiple GPUs through Expert Parallelism ( $EP$ ).
- Route each tokens to the corresponding GPU for processing.
$EP$ introduces challenges in load balancing to prevent overloading individual GPUs or hitting computational bottlenecks.