Cython+ extends Cython with high performance GIL-free types (must-watch video) with static typing.

Cython+ is useful for Python programmers that need the performance of C or C++, the concurrency of Go and the safety of Rust for their production systems but are not willing to migrate all their Python code to C++ (as at Google) or to Go/Rust (as at Dropbox). Cython+ offers to existing Cython developers (ex. scikit-learn) new language primitives that further extend the usability of Cython beyond numerical calculation or wrapping C/C++ libraries.

Compared to C++ or Rust, Cython+ offers a simpler and faster alternative that can be adopted by teams of Python programmers with a faster learning curve. Compared to Go, Cython+ offers in addition simpler integration with existing Python codebase.

The current state and roadmap of the language is summarised here.

 

Feature Status When
GIL-free objects implemented  
Interoperability of GIL-free objects with Python implemented  
GIL-free lists, dicts, sets implemented  
GIL-free string literals urgent Q4 2021
GIL-free strings urgent Q4 2021
GIL-free exceptions soon Q1 2022
Actor model for concurrency implemented  
Ownership type system for safe concurrency proof-of-concept  
Expressing complex task dependencies urgent Q4 2021
Awaiting channels and futures soon Q1 2022
An Improved Object Model not urgent  
Standalone execution without the CPython runtime proof-of-concept  
Standard library (basics) urgent Q4 2021
Continuation-stealing scheduler not urgent  
Memory allocation not urgent  
Syntax for builtin container literals not urgent  
Exposing templated types to Python not urgent  

 

The most urgent next steps are:

  • finalise the semantic of the language, especially concurrency (ex. synchronisation) and strings
  • provide a basic standard library, especially file I/O, filesystem operations and sockets

GIL-free String Literals

Cython+ needs native GIL-free strings with a natural syntax for string literals: currently we merely use the C++ std::string type and Cython tends to interpret strings literals as Python strings. This leads to a lot of friction for even simple string operations with Cython inferring strings literals as Python objects and complaining that Python objects are not allowed inside nogil sections.

So we need a dedicated string syntax to unambiguously denote to GIL-free strings. Currently the idea is:

  • Introduce s-strings with an s prefix to signify GIL-free strings: s"this is a GIL-free string"
  • Add a compiler directive to specify how unprefixed string literals should be infered in priority: the default would preserve the current behavior of infering first as a Python string, but the directive would allow setting GIL-free strings as the priority inference.

The syntax for string literals is a priority because without it we need to bend over backwards in order to manipulate strings.

GIl-free Strings

Moreover, we want to implement GIL-free strings with Cython+'s object model instead of relying directly on C++ strings. This will:

  • let strings be aliased without copying the string,
  • give us full control over the string API, e.g. to let us mirror Python's string methods or enforce immutability,
  • let GIL-free strings be manipulated from Python without copying them into a Python string,
  • give us more control over memory management, e.g. to do I/O without unnecessary copies or initialisation.

In particular, we would like to be able to allocate an uninitialised char * buffer to be filled by I/O functions without having to copy it to a GIL-free string afterward. The basic_string::resize_and_overwrite proposal for C++23 explains this very well. Until it is implemented, the C++ standard string library sadly provides no way to achieve this.

In the meantime, we can either:

  • use std::string as a backend anyway, which will let us use existing string libraries like the excellent fmt,
  • implement our own strings, e.g. using std::string_view introduced in C++17, giving us full control over memory management.

Note that individual implementations of the C++ standard library may already expose internal method that would allow allocating unitialised strings. For example, Abseil has a single-file header-only way to use libc++'s __resize_default_init method if available.

Standard Library

Cython+ needs a standard library beyond builtin types. The most urgent is an I/O library, a filesystem library and a socket library.

This will be achieved by wrapping existing C or C++ libraries in Cython+.

The goal is to provide a basic library that lets developers do useful things in Cython+ without having to wrap C/C++ libraries from scratch each time.

A side goal is to establish guidelines on how to wrap C and C++ libraries to Cython+, to eventually allow users to add to the standard library themselves.

Candidates C/C++ libraries include:

For further functionalities, another library under consideration is Abseil.

This is urgent because without it, it is difficult to write even very simple programs.

GIL-free Exceptions

Cython+ needs exceptions that can be raised and caught without acquiring the GIL. Currently we only have Python exceptions (which require the GIL) and C-style error codes.

We propose to use C++ exceptions because they exist and work well already and because not raising a C++ exception incurs no overhead. The alternative would be to manually propagate exceptions to the caller at every function call, and this means always checking at every call site whether an exception was raised.

However, using C++ exceptions requires changing the way reference-counting instructions are generated for both Python and Cython+ objects: otherwise when an exception is raised, the references in the current scope and the calling scopes will never be decrefed, leading to memory leaks. The solution is to encapsulate reference counting into a C++ objects, so that when the object is destroyed the reference is decrefed. This is called Scope-Bound Resource Management (or Resource Allocation Is Initialisation) and it works because C++ destroys all the objects in the scopes through which an exception is propagated.

GIL-free exception handling will use the same syntax as Python exceptions:

  • raise: whether a C++ or a Python exception is raised will depend on the type of the object being raised
  • try / except: Whether C++ or Python exceptions are caught will depend on the type of exceptions specified in the except clause.

Safe Concurrency with Reference Capabilities

Cython+ introduces a type system of reference capabilities for safe concurrency inspired by Pony, Project Verona and Rust.

This type system revolves around notions of ownership, immutability or thread-locality represented at compile-time by reference capabilities.

Currently an incomplete proof-of-concept has been implemented. It lacks the notion of immutability and much improved syntax.

Task Adapters to Express Dependencies

Cython+ provides a concurrency model based on the actor model. Tasks are scheduled on actors by pushing them in their individual task queue, and each actor handles its tasks asynchronously by executing them in the order they were inserted.

This allows for easy ordering tasks on individual actors: the tasks that are scheduled first are executed first.

We can also relatively easily order tasks with simple multi-actor dependencies, when we need a task Y on actor B to be executed after a specific task X on actor A: this can be achieved by writing task X so that at the end it schedules task Y on B.

But we can already see that this makes task dependencies implicit and mixed all over into the code itself. And for more complex dependencies it becomes increasingly complicated, fast: how do we order tasks that depend on multiple but independent tasks on separate actors? And sometimes the dependencies themselves are dynamic: picture a task Y that must be executed only when there are no scheduled-but-still-unexecuted tasks of type X, possibly themselves all scheduled on different actors. Such a situation requires complex signaling every time a task of type X is scheduled and every time a task of type X finishes executing to determine when task Y must be scheduled.

What we need is a way to encapsulate the task ordering logic in a dedicated object. This object is then be passed at the site where a task is scheduled, to adapt the scheduling and behavior of the task.

Currently we have a proof-of-concept with limited functionality: all the adapter object can do is signal based on an arbitrary condition that a task needs to be rescheduled to be executed at a later time instead of executed immediately. The task is then pushed back to the end of the queue. But this is inefficient: if there are many such tasks in a queue, the actor will spend most of its time pushing tasks back to the end of the queue instead of doing useful work.

Our plan is to provide an new interface to allow the user to customise:

  • How the task is scheduled on the actor, e.g. where is it inserted in the queue (instead of at the end by default)
  • What happens when the task is run, e.g. run or reschedule the task conditionally, signal another actor when the task begins and ends

This is powerful enough to support arbitrarily complex task dependencies in an efficient and non-intrusive way.

Awaiting

Currently, Cython+ implements a purely asynchronous actor model: there is no primitive to wait for an asynchronous task to complete or return a result. Instead, an asynchronous task can only pass data forward to another asynchronous task. For example, actor A can schedule a task on actor B, and that task can forward the result of its computation to actor A by scheduling a task on it.

This actor model is powered by a work-stealing M:N scheduler that dynamically distributes a variable number M of actors over a fixed number N of OS threads which map directly to available process cores. This has much less overhead than creating an OS thread per actor and letting the system scheduler map these threads over the available cores.

We wish to add primitives to support waiting for asynchronous computations to complete. This is important because it will make interfacing synchronous blocking code with asynchronous code much, much easier. Also it enables composing asynchronous functions and propagating results or exceptions back to the calling function.

This raises interesting implementation challenges: if an actor suspends execution waiting for another actor to asynchronously complete a computation, we cannot afford to block the underlying thread on which the actor was running, because that would prevent all the other actors from running on this thread / core. If all threads used by the actor scheduler are suspended waiting for tasks to complete, there are no more threads available to run these tasks.

A general solution is to use context-switching to save the stack of the waiting task and then switch to running other tasks exactly as if the first task had actually completed. When the awaited task completes, we can context-switch back onto the suspended task and resume execution. Since there is no native support in C++ for context-switching, so we would rely on external libraries like boost::context. It comes with some caveats, mainly that we should never context-switch during stack-unwinding while an active exception has been thrown but not yet caught. Fortunately, this case can be detected using std::uncaught_exception.

As for the primitive itself, we intend to implement channel types similar to Go Channels and Python Queues, on which execution can be suspended while awaiting that an asynchronous task pushes a result to it. In particular, channels can easily be used as Futures/Promises. This of course requires support from the scheduler. If useful we can implement some variations, like dedicated Futures/Promises types.

Additionally, we're considering introducing a dedicated await syntax to wait on channels so that it is clear just by looking at the code where it might suspend.

The higher-level library boost::fibers (which uses boost::context internally) seems like an excellent starting point to implement a work-stealing actor scheduler with support for awaiting.

Note that with the additional constraint that we can only wait on tasks from the top frame of an asynchronous task instead of at any depth in the call stack, like in stackless coroutines or async/await, we can avoid the need for context-switching.

An Improved Object Model

The current model for GIL-free objects is based on runtime type polymorphism with virtual methods and subclasses that exactly respect the interface of their base class.

This allows subclasses to act as drop-in replacements to their base classes. For example lists declared as containing given type A can actually contain a heterogeneous mix of subclasses of A.

This is useful as it allows e.g. library developers to define an interface that can be customised later by users.

However, it comes with some drawbacks and limitations:

  • virtual methods add an indirection and prevent inlining (performance overhead)
  • unrelated types with compatible interfaces cannot be interchanged (no duck-typing)
  • subclasses must respect the interface of their base class (same method signatures, same attributes)
  • to support the "diamond" scenario in multiple inheritance we need to add another indirection with virtual inheritance

In many cases, all that is actually required is compile-time polymorphism, e.g. when we want functions that works for all types as long as they have the needed interface. For this purpose, C++ templates are enough, and actually provide more functionality than runtime subtyping: any type that the function in the sense of duck-typing will just work (instead of just the subtypes of the interface imposed by the function).

But what if the type doesn't match ? C++ compilers tend to generate incomprehensible error messages when templates fail to compile, so we want to avoid that. Fortunately, there is an easy solution: we can make the compiler fallback on an alternative code that will never fail to compile but will instead raise an exception at runtime, exactly like when Python raises AttributeError or TypeError.

If we forgot the requirement that subtypes are able to act as drop-in replacements for their base classes, we can design an object model where:

  • subclasses can write methods with different signatures than in their base classes
  • subclasses can completely redefine their attributes compared to their base classes
  • methods can have untyped arguments using C++ templates under the hood
  • method calls can be inlined and will be faster even when they aren't by avoiding an indirection
  • mixin classes can have methods that refer to attributes that only exist in the class that will inherit the mixin
  • the method resolution order is the same as in Python without any overhead, even with mixins and multiple inheritance
  • multiple inheritance behaves the same in Python without any overhead

Essentially, inheritance becomes all about reusing code instead of runtime polymorphism, and this makes it much better at facilitating code reuse. It would feel much more like Python and not unlike like Rust Traits.

But it seems that by doing this we lose the ability to have runtime polymorphism, e.g.:

  • containers of heterogeneous types
  • separately compiled code thanks to strictly typed interfaces

Well actually we can still provide runtime polymorphism through mechanisms other than inheritance:

Such interfaces types will just consist of strictly typed method and attribute declarations. They can be implemented in C++ using type erasure. Like in Go, a type will implicitly match any interface with compatible methods and attributes. In particular the method signatures and attribute types don't need to be identical, as long a they can be coerced to match. At the C++ level, a proxy type that strictly matches the interface will be automatically generated to forward access to the actual type.

This approach brings multiple advantages:

  • types will be able to match an interface even if their methods don't have the exact same signature
  • runtime polymorphism (via interfaces) will be uncoupled from code reuse (via inheritance)
  • the overhead cost of runtime dispatch will be only need to be payed when it's actually needed
  • types will be able to match interfaces without prior knowledge of each other thanks to implicit matching
  • matching a type to an interface will feel like duck-typing
  • primitive types such as integers will be able to match interfaces

Tagged unions can be implemented using std::variant and provide runtime dispatch with std::visit. They can provide an alternative way to manipulate heterogeneous types without the need to explicitly declare the interface, but they do this by only matching a predetermined set of types.

Standalone Execution without the CPython Runtime

If a Cython+ program uses only GIL-free objects without using Python at all, it actually doesn't need the CPython runtime at all. But Cython+ has no native support yet for expressing that concept and checking during compilation that no CPython functions are ever called, so the program will still include the headers from CPython and generate code to allow importing from Python as an extension module. It's just that the program can define a main function from which there is no code path that actually calls the CPython runtime.

Proof-of-concept experiments have shown that such a Cython+ program can be compiled and run without linking against the CPython runtime. The first experiment relied on simply telling the C++ compiler to ignore linking errors, which was unsafe, brittle and highly dependent on the compiler.

Since then we've had success generating and linking against a fake CPython runtime where every function merely terminates the program with an error message. This strategy should be fully portable and will provide useful information if the program turns out to call CPython after all (instead of crashing with a segmentation fault due to calling an unresolved symbol).

A longer term goal is to add support to Cython+ to enforce GIL-freedom across the whole program and avoid generating code that depends on the CPython headers.

Continuation-Stealing

With the current implementation of the actor scheduler, when a task is scheduled on an actor it is pushed into its queue and the function that spawned the task immediately continues execution. With a continuation-stealing scheduler, we would instead try to context-switch onto the task to execute it immediately and instead defer the continuation of the function that spawned the task until a core becomes available.

Continuation-stealing is a good way to avoid spawning an excessive amount of tasks (e.g. in loops), by instead waiting (e.g. at each loop iteration) that a core becomes available to run the continuation. Otherwise we can end up with unbounded task queues that grow faster than the tasks can be executed. In other words it's a way to provide backpressure to manage the growth of task queues.

Within an actor system, there is a complication: if the actor is already busy with other tasks, it is impossible to execute the task immediately. In that situation we have two choices:

  1. Push the task into a queue and continue executing the function that spawned the task.
  2. Suspend the current thread of execution until the actor becomes idle and the task can be executed.

In fact, which behavior to adopt could be a property of the actor's queue:

  • When inserting a task into an empty queue, the queue can choose whether to execute the task immediately and defer the caller
  • When inserting a task into a non-empty queue, the queue can choose whether to just store the task or to suspend the caller instead

This could lead to multiple kinds of queues:

  • continuation-stealing queues: when the queue is empty, inserting a task actually context-switches unto the task and defers the continuation
  • unbuffered queues: when the queue is not empty, inserting a task suspends the caller until the queue becomes empty
  • bounded queues: the queue has a maximum capacity and inserting more tasks suspends the caller until the queue is not full
  • unbounded queues: more tasks can always be inserted without suspending the caller no matter its capacity

It would allow each actor to choose what kind of backpressure it needs.

Memory Allocation

Cython+ currently uses C++'s default operators new and delete to allocate and free memory for its GIL-free objects. How these operators are actually implemented is a property of the particular implementation of the standard library used by the program. Usually they use malloc, which itself comes from the implementation of the C library.

In multi-threaded programs, the memory allocator can be a performance bottleneck due to thread contention: since the memory space is shared, it must be able to safely deal with concurrent requests for allocation or release. Implementations have different strategies to reduce thread contention, some of which are very efficient. In particular, the common case where an object is allocated and freed from the same thread can be optimised.

C++ supports user-provided global and class-specific replacements for these operators, meaning we can fully customise the memory allocation strategy.

We could start by comparing various existing malloc implementations geared towards multi-threading, like:

A more sophisticated approach would be to design a custom memory allocator that leverages our actor system and our type system, e.g. with per-actor heaps, like in:

Syntax for Built-in Container Literals

In Python, containers like lists, dicts and tuples can be created using a dedicated literal syntax: square brackets for lists, curly brackets for dicts, parentheses for tuples.

At some point, it would be nice to have a natural literal syntax for the GIL-free equivalents. The syntax will probably need to be different because the Python syntax is already reserved for the Python containers.

This task is not urgent because it does not impact the semantic of the language. Syntactic sugar can be added later.

Exposing Templated Types to Python

When the GIL is acquired, Cython+'s GIL-free objects can be seen as generic Python objects and manipulated directly from the Python interpreter. Under the hood they are actually full-fledged Python objects with a Python reference count and a pointer to a custom Python extension type object. These are only accessible when the GIL is acquired, without the GIL they rely on static typing to let the compiler resolve method calls and such.

For each GIL-free type, a custom Python extension type is generated to expose its methods and attributes via wrapper functions and Python descriptor objects. Any method and attribute with types that the compiler doesn't know how to convert to Python objects is ignored. The custom Python extension type also defines how to create a new instance directly from Python.

However, templated GIL-free types like containers that use C++ templates under the hood are currently not exposed to Python. That's because without specifying a contained type a list is not yet a type but a template that will be compiled into many totally unrelated types. A list of integers is not the same type as a list of strings. The same goes for any other template type.

One thing we could do is generate separate custom extension types for selected individual instantiations of a template, so a list of integers and a list of strings wouldn't have the same Python type. Since we cannot know in advance all the types into which a template will be compiled, we could by default only expose the most common cases and provide a syntax to let the user select other types.

But there is a better alternative: we can use type erasure to expose a common interface with separate implementations that will forward calls to the actual underlying type. The interface would only accept and return generic Python objects. Each implementation would be responsible for converting objects from and to Python types and emitting exceptions if the passed Python objects do not have a compatible type. We would then only need to generate one custom extension type to expose the interface type to Python.

This approach would fit very well with the proposed improved object model.