Neural Actors:
The intersection between
machine learning, hardware,
and the actor model
@Smerity
My history is a mix of
linguistics
and
computing
Leveraging more compute
Actors
ML
Will Wright's "Dynamics for Designers"
Rich interactions for systems we don't yet know / can't explicitly specify we're developing
An actor can:
- Send a finite number of messages to other actors
- Actors have a mailbox with an address and bounded / unbounded capacity
- Create a finite number of new actors
The last point minimizes the need for orchestration systems
The actor model
The actor model helps solve:
- Concurrency (no shared state, many cores)
- Scalability (spawn new actors as needed, potentially on nodes across the network)
- Reliability (actors supervising other actors)
- Flexibility: "Actor creation plus addresses in messages means variable topology"
The actor model
Erlang / Elixir:
- Actors as a language primitive
- Erlang/OTP used for reliable telecoms
Rust's "fearless concurrency":
- Mutable with single owner
- Immutable with many references
The actor model in Software 1.0
Actor model enabling Amazon's two pizza rule
WhatsApp: "35 engineers and reached more than 450 million users" (pre Facebook acquisition)
Discord: "scaling to over a 100 million messages per day with only 4 backend engineers in 2017 and serving 250+ million users with less than 50 engineers in 2020"
Team and software scaling
Actors and the web at scale
Lone engineer at CommonCrawl
(2.5 petabytes and 35 billion webpages)
Extensive use of MapReduce
(simplified actor model if you squint)
The web as an actor ecosystem
(concurrency, scalability, reliability, flexibility, interoperability, ...)
Frank McSherry's COST
Graph processing (single threaded laptop) w/ Rust
Actors and ML
ML frameworks act as message passing++
(fwd + bwd are sync / async msgs)
Actors are explicit operations learning and performing implicit tasks via obj functions
ML components can be seen as actors but:
- High parallelism, minimal concurrency (SPMD)
- Inability to spawn (except limited by the above)
The limitations of hardware
SPMD means "one (hammer) program"
The result? All problems made equivalent nails
Multi-tenancy would provide different primitives
(at least CPUs are ~good at time sharing)
At present any "spawning" is manually specified
Remember: "Actor creation plus addresses in messages means variable topology"
What does MPMD look like atm?
- NVIDIA: High end cards at best 7 MPMD (MIG)
(max theoretical is 108 as 108 SMs)
...
The best you can do is run many nodes with many cards and send messages about
This gives you the horrors of both worlds: neural networks and container orchestration!
Mixture of Experts (MoE)
Tenstorrent 🤔
- Many small independent cores (RISC + SIMD)
- Cores communicate via network packets
- Cores agnostic to same node / cross network
- Conditional / variable computation
- High parallelism, high concurrency
The dream: XPUs
Small + many enables a future of:
"This programs requires 8 XPU cores"
Why? I desperately want to be able to write a program featuring ML that doesn't rely on them having internet access or a $1k card (with the right drivers installed ...)
- Programs that don't rely on foreign API
- Doesn't require a local $1k card (with right drivers)
- Edge models don't need as much conversion
Neural actors
A neural actor can:
- Send a finite number of messages to other actors
(Explicit addr or implicit addr via attention)
- Actors have a mailbox with an address and bounded / unbounded capacity
- Create a finite number of new neural actors
Neural actor possibilities
Scale up/down network and compute
- Proxy actors for messaging (filter out at source, predict missing packets, ...)
- LM actors spawned between components for shared compressed language / comms
- TEMPEST actors: "ephemeral arbitrator between AIs w/o knowledge exposed"
- Spare capacity for expansion / distillation
- Treat msgs over network like RNN BPTT
Example: Expert Choice MoE
Ancient history: n-grams
Past decades: search engine's inverted index
(past: "Actor" appears on pages A, B, C, ...)
Recent:
word2vec("Actor") => 1024 dim f32 vector
"Actor" + context => 1024 dim vector
The word connecting to its own meaning
Know a word by the company it keeps
Data actors
An actor expanding / better understanding a specific piece of data (implicit program)
- Naive: inverted index (i.e. search)
- Implicit: language models / embeddings
- Explicit: data actors continuously shifting about an actor ecosystem
"Actor" node has high entropy .: mitosis spawns: "Actor (programming)", "Actor (arts)"
Traveling Linguist Problem
Data actors shifting about an actor ecosystem
=>
A data ecosystem groups, sorts, and removes redundancies within itself such that you have the minimal surprise learning a language
Will Wright's "Dynamics for Designers"
Rich interactions for systems we don't yet know / can't explicitly specify we're developing
Actors Everywhere
By smerity
Actors Everywhere
- 911