An M-to-N Actor Scheduler for Cython+

TL;DR

In the actor model, actors receive messages that represent requests for asynchronous computation, and at some later point execute them one at a time.

This is a way to structure concurrent programs; now comes the time to determine how to run such a program so as to take full advantage of the available computational resources through parallel execution.

  1. The computational resources in question are system threads and hardware cores
  2. We apply the M-to-N model of scheduling to actors
  3. We present a work-stealing scheduler inspired by Go's
  4. We will need coroutines to support promises
  5. In the future we wish to introduce asynchronous I/O for better performance

1. Hardware Cores and OS Threads

OS threads mapped to hardware cores represent the computation resources.

Modern computers have turned to multicore processor architectures to boost performance because improvements in linear speed are drying up. That is why writing concurrent programs is increasingly important.

Operating systems abstract these hardware resources as processes and OS threads. Each process corresponds to a separate running program, so within a single program, OS threads are the relevant abstraction of computation resources. OS threads share resources such as memory.

Operating systems include a scheduler to handle allocating running time on the processor cores to their processes and OS threads. These schedulers commonly use preemption to interrupt running threads and slot waiting threads in their place, so as to give all threads a turn. This is called context-switching.

Most modern processor cores support hardware multithreading executing multiple threads on the same core simultaneously at the hardware level. Intel's terminology for this is hyper-threading. So the number of threads a CPU can run truly simultaneously depends on the number of cores and their hyper-threading capabilities.

2. Scheduling M Actors onto N OS Threads

We wrote our own M-to-N actor scheduler for performance reasons.

A first idea to run our actors would be to associate to each actor its own OS thread. It would execute the received messages one after another in a loop, and otherwise wait for new messages. The OS scheduler would then handle all the work of running the actor threads.

Such a design would follow the 1-to-1 scheduling model: one actor for one OS thread. It's easy and intuitive, but has substantial performance drawbacks:

  • Expensive thread creation: creating a thread involves system calls to the OS kernel
  • Expensive preemption: preemption is outside of our control and involves context-switching
  • Expensive synchronisation: synchronisation involves system calls and context-switching

Instead, we decided to write our own M-to-N scheduler to run any number M of actors over N OS threads fixed for the duration of the program. This provides an abstraction over OS threads analogous to the abstraction over hardware cores provided by the OS scheduler.

By choosing N as the number of worker threads the hardware will support running truly simultaneously, we can in theory avoid the overhead of preemption by the OS scheduler: each thread will be able to run without interruption. In practice some preemption might be unavoidable: other programs could take up part of the workload, and the OS itself might need some computation resources as well.

In any case, this lets the program scale up or down to adapt to the available computation resources, just by choosing the number N at the beginning of execution. And since the OS threads need only to be created once at the beginning, creating actors is now very cheap.

3. A Cooperative Scheduler with Work-Stealing to Avoid Idleness

Work-stealing is a efficient way to keep the work balanced across the scheduler's worker threads.

Our scheduler design takes inspiration from Go's M-to-N scheduler.

One particularity in our case is that messages to the same actor need to be processed one at a time, so we're not simply dispatching messages onto available OS threads. Instead we're dispatching actors onto OS threads, and each OS thread can run one or more of an actor's messages before switching to another actor.

Our scheduler is cooperative instead of preemptive: actors are never forcibly interrupted to switch to another actor. A worker thread can only decide to switch between actors when the running actor returns control to the scheduler.

Our proof-of-concept does not (yet) support suspending execution at arbitrary points and resuming later from the same point. This means actors can only return control to the scheduler by returning from processing a message, and the worker threads can only switch actors in between messages or when the actor runs out of pending messages.

Our scheduler implements a work-stealing scheduling strategy. Actors are initially assigned to a worker thread chosen randomly the first time they receive a message, and worker threads that run out of work can steal waiting actors from threads that are busy. When an actor runs out of pending messages the thread drops it so that it becomes unnassigned, until a new message gets it assigned to a worker thread again.

This strategy lets the scheduler balance the workload between the OS threads at execution time. That's ideal for task parallelism with heterogenous tasks and tasks spawned by other tasks at execution - in our case heterogenous actors that can spawn more actors. Unlike with data parallelism, this means the work cannot be known and partitioned in advance. That applies for programs doing tree traversal or event-driven programs like web servers.

Work-stealing keeps contention low by avoiding a central point of synchronisation to determine how to balance the load. Instead only peer-to-peer interactions between worker threads are required.

4. The Need for Coroutines

Coroutines - functions that can suspend and resume - are required to support promises.

A "function" that can suspend execution and resume later is called a coroutine.

This ability is required to support blocking operations such as waiting until a promise is fulfilled to unwrap the result of an asynchronous computation. In fact, coroutines and promises are often interwoven in languages that support them natively.

Without support for coroutines, the only way for actors run by our scheduler to do blocking operations is to suspend the whole underlying OS thread. This is not ideal if the blocking operation stretches on because in the meantime the OS thread is not doing any other work. It gets quite problematic when resuming the OS thread is contingent on some other work being computed, such as when waiting for a promise: if all the worker threads are suspended, there is no OS thread left to do the work that would fulfill the promises and resume them, and the whole system ends up completely deadlocked.

One solution would be to spawn new worker OS threads to replace suspended worker threads, and phase worker threads out again as suspended threads resume. But this is not ideal because creating and destroying OS threads and letting the OS scheduler do the preemption and context-switching is expensive.

Instead, we're planning on using userland context-switching and avoid involving the OS scheduler. This is much cheaper than letting the OS scheduler handle context-switching because it avoids switching to kernel mode and back to user mode.

5. The Need for Asynchronous I/O

Asynchronous I/O is still needed to avoid tying up the scheduler's worker threads during I/O operations.

We need asynchronous I/O for the same reason we need coroutines: to avoid suspending an OS thread waiting for I/O to complete.

Instead of blocking waiting for I/O to complete, I/O operations such as read or write could merely return a promise. The caller would then be free to immediately continue with other work or to suspend execution waiting for the promise to complete, giving other actors the opportunity to run on this OS thread in the meantime.

With promises and the coroutine ability to suspend/resume execution, asynchronous I/O can feel exactly like synchronous I/O when the caller immediately waits for the promise to be fulfilled.

One way this could be implemented is with an OS thread dedicated to running an asynchronous I/O loop. When a requested I/O operations completes, the loop wakes up and fulfills the promise associated with the request. Operating systems provide mechanisms for asynchronous I/O such as the Linux kernel system calls epoll or io_uring.

Asynchronous I/O scales extremely well and is one of the best ways to tackle the C10K problem of handling tens of thousands of simultaneous connections.