Expert Emergence in Mixture-of-Experts Architectures

sumitdotml Independent sumit@sumit.ml
·

Abstract

This is just a dummy, real content will be added once I complete the project.

Keywords: transformers, attention mechanisms, mixture-of-experts, expert independence

Everything here is dummy for now, fyi! I will be adding real content once I complete the project.

Introduction

Transformer architectures have become the dominant paradigm for natural language processing tasks (Elhage et al., 2021) .

Testing citation here (Vaswani et al., 2017) .

Background

The Dummy Circuit

Some LaTex dummy:

WOV=WVWOW_{OV} = W_V W_O

LaTex in a normal paragraph can be added like this:

When WOVIW_{OV} \approx I (the identity matrix), the attention head primarily copies information from attended positions without significant transformation.

Measuring Dummy Behavior

We can quantify how close a dummy is to being a dummy using the dummy norm:

dD(WOV,I)=WOVID=(i,j(wijδij)2)1/2d_D(W_{OV}, I) = \|W_{OV} - I\|_D = \left(\sum_{i,j} (w_{ij} - \delta_{ij})^2\right)^{1/2}

where δij\delta_{ij} is the dummy delta.

Inline LaTex dummy:

dD(WOV,I)=WOVID=(i,j(wijδij)2)1/2d_D(W_{OV}, I) = \|W_{OV} - I\|_D = \left(\sum_{i,j} (w_{ij} - \delta_{ij})^2\right)^{1/2}

Methodology

Our analysis proceeds in three steps:

  1. dummy step 1
  2. dummy step 2
  3. dummy step 3

Model Selection

We analyze the following architectures:

DummyParametersLayersHeads
Dummy 1124M1212
Dummy 2355M2416
Dummy 3774M3620

Results

Results to be added in future versions.

Code sample:

import torch
import torch.nn as nn
import torch.nn.functional as F

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 20) # sample comment to extend the line - more dummy text to see how it looks
        self.fc2 = nn.Linear(20, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x

Conclusion

Conclusion ooga booga

References

How to write references (cases for pub link as well as arXiv links below):

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.