2022-09-22

Traversing nested lists with coroutines, Rosetta Code style

by Kunshan Wang

Table of contents [show/hide]

I’ll try to use coroutines to traverse nested lists, Rosetta Code style. That means I’ll do it in many different programming languages and libraries, including Ruby, Lua, Python (including greenlets), JavaScript, Rust, C#, etc. This task shows the difference between symmetric vs asymmetric coroutines, and stackful vs stackless coroutines.

Note that this post alone may not be enough to teach you how to use coroutines in all those languages.

I’ll also provide basic information about coroutines, swap-stack, async/await, etc. in the appendices.

The task

Input:

a nested list of numbers, such as [1, [[2, 3], [4, 5]], [6, 7, 8]]

Output:

recursively output all numbers in the list. At each level, visit all numbers in one element before visiting any subsequent elements.

When given the list above, the output should be 1, 2, 3, 4, 5, 6, 7 and 8, in that order.

Requirement:

Use coroutine(s) to enumerate a nested list, and yield elements to the calling coroutine one at a time.

I will try to do this task using as many programming languages as possible, Rosetta Code style, to compare their coroutine syntax and API. At the time of writing (2022), different programming languages still differ greatly w.r.t. the design of coroutines.

The code

Ruby fibers (stackful, both asymmetric and symmetric)

Fibers are “primitives for implementing light weight cooperative concurrency in Ruby”.

Ruby fibers are stackful. According to the documentation:

As opposed to other stackless light weight concurrency models, each fiber comes with a stack. This enables the fiber to be paused from deeply nested function calls within the fiber block.

Ruby fibers can operate in both asymmetric and symmetric mode. I’ll demonstrate the task in both modes below.

Asymmetric

The Fiber#resume instance method resumes a fiber, and a subsequent call to the Fiber.yield class method jumps back to the resumer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def traverse(x)
  if x.is_a? Array
    x.each do |elem|
      traverse(elem)      # recursive call
    end
  else
    Fiber.yield x         # can yield within recursive calls
  end
end

def fiber_traverse(x)
  Fiber.new do
    traverse(x)
    raise StopIteration
  end
end

DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

fiber = fiber_traverse DATA

loop do   # Break if StopIteration is raised.
  value = fiber.resume
  puts value
end

The Enumerator class can automatically transform block-based visiting functions into fiber-based coroutine. It uses fiber only when necessary. It uses fiber when used as external iterators (calling e.next explicitly), but still uses call-back for internal iteration (e.each { |v| ... }).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def traverse(x, &block)
  if x.is_a? Array
    x.each do |elem|
      traverse(elem, &block)
    end
  else
    block.yield(x)  # This is just a usual method call, not a coroutine yield.
  end
end

def enum_traverse(x)
  Enumerator.new do |yielder|   # The yielder encapsulates how to yield.
    traverse(x) do |value|
      yielder.yield(value)      # This may or may not use coroutine yield.
    end
  end
end

DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

# We can do internal iteration
enum_traverse(DATA).each do |value|   # use call-back
  puts value
end

# and external iteration too.
e = enum_traverse(DATA)
loop do           # Break if StopIteration is raised.
  value = e.next  # This will use fiber.
  puts value
end

Symmetric

The Fiber#transfer method can switch to any fiber, but always needs an explicit fiber to switch to. We can pass the current fiber to the new fiber when we create it, so it can remember which fiber to transfer back to.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
require "fiber" 

def traverse(x, parent)
  if x.is_a? Array
    x.each do |elem|
      traverse(elem, parent)  # always remember the parent
    end
  else
    parent.transfer x
  end
end

def fiber_traverse(x)
  current = Fiber.current     # get the current fiber
  Fiber.new do
    traverse(x, current)      # pass the fiber as parent
    raise StopIteration
  end
end

DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

fiber = fiber_traverse DATA
loop do   # Break if StopIteration is raised.
  value = fiber.transfer
  puts value
end

Lua threads (stackful, asymmetric)

Lua “threads” are stackful coroutines. Lua has a stackless interpreter, therefore it can easily implement stackful coroutines (Why? See appendix).

Lua provides asymmetric coroutines (with limitations) for the sake of simplicity and portability.

Simplicity: According to Coroutines in Lua, asymmetric coroutines may be easier to understand.

On the other hand, asymmetric coroutines truly behave like routines, in the sense that control is always transferred back to their callers. Since even novice programmers are familiar with the concept of a routine, control sequencing with asymmetric coroutines seems much simpler to manage and understand, besides allowing the development of more structured programs
Portability: Supporting symmetric coroutines (or even proper stackful asymmetric coroutines) will require C to have coroutine facilities, such as swap-stack, which is not always available. According to Coroutines in Lua:

Lua and C code can freely call each other; therefore, an application can create a chain of nested function calls wherein the languages are interleaved. Implementing a symmetric facility in this scenario imposes the preservation of C state when a Lua coroutine is suspended. This preservation is only possible if a coroutine facility is also provided for C; but a portable implementation of coroutines for C cannot be written.

Lua also added a limitation: a coroutine cannot yield while there are C function frames on its stack. Otherwise, Lua would require a swap-stack mechanism for C, making Lua less portable.

In Lua, the coroutine.resume function continues the execution of a coroutine, and coroutine.yield jumps back to the calling coroutine.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
function traverse(x)
  if type(x) == "table" then
    for i, v in ipairs(x) do
      traverse(v)         -- recursive call
    end
  else
    coroutine.yield(x)    -- can yield within recursive calls
  end
end

function coroutine_traverse(x)
  return coroutine.create(function()
    traverse(x)
    return nil
  end)
end

local list = {1, {{2, 3}, {4, 5}}, {6, 7, 8}}

local coro = coroutine_traverse(list)

while true do
  local _, value = coroutine.resume(coro)
  if value == nil then
    break   -- terminated
  end
  print(value)
end

The coroutine.wrap function can wrap the coroutine into an iterator function suitable for the generic for statement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function traverse(x)
  if type(x) == "table" then
    for i, v in ipairs(x) do
      traverse(v)
    end
  else
    coroutine.yield(x)
  end
end

function coroutine_traverse(x)
  return coroutine.wrap(function()
    traverse(x)
  end)
end

local list = {1, {{2, 3}, {4, 5}}, {6, 7, 8}}

for value in coroutine_traverse(list) do
  print(value)
end

Python generators (stackless, asymmetric)

Python generators are a built-in feature since Python 2.x. They are single-frame coroutines.

A function that contains a yield keyword is considered a generator function. Calling a generator function will create a new generator object stopped at the beginning of the function, and can be resumed with the next(...) built-in function.

Being stackless, each coroutine has only one frame, so it cannot yield while calling another function. To implement recursive traversal with stackless coroutines, it is common to create one generator for each level of nested list, and yield values from the innermost coroutine to the outer coroutine, level by level.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def traverse(x):
    if isinstance(x, list):
        for elem in x:
            new_gen = traverse(elem)    # Create the next level of generator
            try:
                while True:
                    v = next(new_gen)
                    yield v             # Yield everything the inner generator yields
            except StopIteration:       # until iteration stops.
                pass
    else:
        yield x

DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

gen = traverse(DATA)    # The top-level generator
try:
    while True:
        v = next(gen)   # Iterate through it
        print(v)
except StopIteration:   # until iteration stops.
    pass

Python’s for statement is a syntax sugar for calling next(...) until the exception StopIteration is thrown. The yield from statement is a syntax sugar for yielding everything from another generator. Using all the syntax sugar, the code above will become the following:

1
2
3
4
5
6
7
8
9
10
11
def traverse(x):
    if isinstance(x, list):
        for elem in x:
            yield from traverse(elem)   # Use "yield from" to yield everything.
    else:
        yield x

DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

for v in traverse(DATA):
    print(v)

Python coroutines (WTF?)

Python 3.5 attempts to introduce async/await-based asynchronous programming mechanisms, but it used the word “coroutine” to refer to functions annotated with the async keyword, like async def foo(...), which is confusing. Async functions may contain the new await expression, but its semantics is very vaguely defined as “suspend the execution of coroutine on an awaitable object”, whatever “on an awaitable object” means. That is in stark contrast to the highly detailed semantics of co_await expression in C++20 and the .await expression in Rust.

In the face of ambiguity, refuse the temptation to guess.

– Zen of Python

Because it is so confusing, I am not going to do the task using Python “coroutines”.

Python greenlet (stackful, symmetric)

Symmetric

The greenlet library provides stackful symmetric coroutines.

Greenlets are stackful. According to the documentation:

A “greenlet” is a small independent pseudo-thread. Think about it as a small stack of frames; the outermost (bottom) frame is the initial function you called, and the innermost frame is the one in which the greenlet is currently paused.

Greenlets are symmetric. One greenlet can switch to another using the glet.switch() method to pass a value, or glet.throw() to switch and immediately raise an exception.

There are implementations of Greenlets for both CPython and PyPy.

The official greenlet implementation for CPython uses platform-specific assembly code (for amd64, aarch64, riscv, etc.) to switch native stacks, similar to what [Boost Context] does.

PyPy implements the greenlet API using stacklets, which are PyPy’s own swap-stack mechanism. Like [Boost Context] and the official greenlet for CPython, PyPy also uses platform-specific assembly code (for x86-64, aarch64, mips64, etc. Sorry, RISC-V.) to switch between stacks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import greenlet

def traverse(x, parent):
    if isinstance(x, list):
        for elem in x:
            traverse(elem, parent)      # recursive call
    else:
        parent.switch(x)                # can switch at any level of stack
        
def greenlet_traverse(x):
    current = greenlet.getcurrent()     # remember the current coroutine
    def _traverse_x():
        traverse(x, current)            # and pass it as the parent
        current.throw(StopIteration)
    return greenlet.greenlet(_traverse_x)

DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

glet = greenlet_traverse(DATA)
try:
    while True:
        v = glet.switch()   # switch to the greenlet
        print(v)
except StopIteration:
    pass

Emulate asymmetric coroutine using the `parent` field

We have just demonstrated that greenlets are symmetric. However, each greenlet has a parent. It is the coroutine to switch to when the current coroutine terminates, normally or by exception. However, it doesn’t mean greenlets are asymmetric because the parent can be changed at any time during execution, and it is not wrong to explicitly switch to the parent.

We can rewrite our last example and use the glet.parent field instead of our own parent variable to record the parent.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import greenlet

def traverse(x):
    if isinstance(x, list):
        for elem in x:
            traverse(elem)
    else:
        greenlet.getcurrent().parent.switch(x)  # switch to parent
        
def greenlet_traverse(x):
    def _traverse_x():
        traverse(x)
        raise StopIteration()
    return greenlet.greenlet(_traverse_x)  # The parent is the current greenlet

DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

glet = greenlet_traverse(DATA)
try:
    while True:
        v = glet.switch()   # switch to the greenlet
        print(v)
except StopIteration:
    pass

JavaScript generators (stackless, asymmetric)

The function* declaration defines a generator function. Generator functions can have yield operator that pauses the execution of the coroutine.

When a generator function called, it creates a generator object. It can be used like an iterator. The next method switches to the coroutine.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
function* traverse(x) {
    if (Array.isArray(x)) {
        for (const elem of x) {
            for (const y of traverse(elem)) {
                yield y;  // Yield what the inner layer yields.
            }
        }
    } else {
        yield x;
    }
}

const DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]];

let gen = traverse(DATA);
for (;;) {
    const result = gen.next();  // Resumes the coroutine.
    if (result.done) {
        break;
    }
    console.log(result.value);
}

And there are syntax sugars. The yield* operator yields everything from another generator. Because a generator behaves like an iterator, the for-of statement can iterate through the values it yields.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function* traverse(x) {
    if (Array.isArray(x)) {
        for (const elem of x) {
            yield* traverse(elem)
        }
    } else {
        yield x
    }
}

const DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

for (const v of traverse(DATA)) {
    console.log(v);
}

JavaScript async/await (stackless, asymmetric, asynchronous)

JavaScript provides asynchronous programming facilities in the form of async/await (see appendix). An async function always returns a Promise object which can be settled (fulfilled or rejected) later. An async function may contain await operators which cause async function execution to pause until its operand (a Promise) is settled, and resume execution after fulfilment.

Asynchronous programming is more like cooperative multi-tasking than coroutines.

Despite the difference, I now give an example of traversing nested lists using async/await. I create two concurrent tasks, one traverses the nested list, and the other prints the numbers, and they communicate through a “zero-capacity queue”. It is similar to multi-thread programming, except there is only one thread executing both tasks in alternation, and the execution is scheduled by a scheduler.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// A zero-capacity queue.
// `enqueue` will block until another task calls `dequeue`,
// and `dequeue` will block until another task calls `enqueue`.
class ZeroQueue {
    constructor() {
        this.getter_resolver = null
        this.data = null
        this.setter_resolver = null
    }

    async enqueue(num) {
        if (this.getter_resolver != null) {
            // If a consumer came before, we satisfy it.
            this.getter_resolver(num)
            this.getter_resolver = null
        } else {
            // If we come first, we leave the value and wait until consumed.
            this.data = num
            await new Promise((resolve) => {
                this.setter_resolver = resolve
            })
        }
    }

    async dequeue() {
        if (this.setter_resolver != null) {
            // If a producer already came, we take the value and let it continue.
            this.setter_resolver(null)
            this.setter_resolver = null
            return this.data
        } else {
            // If we come first, we wait for the producer.
            return await new Promise((resolve) => {
                this.getter_resolver = resolve
            })
        }
    }
}

const queue = new ZeroQueue()

async function traverse(x) {
    if (Array.isArray(x)) {
        for (const elem of x) {
            await traverse(elem)
        }
    } else {
        // await may potentially yield,
        // giving the user an impression of block-waiting.
        await queue.enqueue(x)
    }
}

async function print_all() {
    for (;;) {
        // await may potentially yield,
        // giving the user an impression of block-waiting.
        const v = await queue.dequeue()
        if (v == "end") {
            return
        } else {
            console.log(v)
        }
    }
}

const DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

// The first task traverses the list and signal termination.
traverse(DATA).then(() => {
    queue.enqueue("end")
})

// The second task keep polling till the end.
print_all()

JavaScript async generators (stackless, asymmetric, asynchronous)

Functions annotated with async function* defines an async generator function. An async generator is like a generator, but the .next() method returns a Promise so it can be awaited. This allows the generator to use await while iterating. It can also use the for await ... of statement as a syntax sugar.

This practice is like building coroutine on top of async/await on top of coroutine, which looks ugly to me. Anyway, here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
async function* traverse(x) {
    if (Array.isArray(x)) {
        for (const elem of x) {
            yield* await traverse(elem)
        }
    } else {
        yield x
    }
}

const DATA = [1, [[2, 3], [4, 5]], [6, 7, 8]]

async function main() {
    for await (const v of traverse(DATA)) {
        console.log(v);
    }
}

main()

Rust async/await (stackless, asymmetric, asynchronous)

Rust’s async and await keywords provides support for asynchronous programming (see appendix) based on stackless asymmetric coroutines. There is a dedicated book that covers asynchronous programming in Rust.

An async function or an async block, when executed, do not execute their bodies immediately, but creates an object that holds the execution context of that function or block. Each async function or block is represented to the user as a Future. The Future::poll method will resume the async thing until it yields (on an await site) or finishes.

The await expression can only be used in async functions or blocks. Its semantics is complicated but well-documented. It calls Future::poll on a Future object and, if the Future is ready, it grabs its value continues without yielding; otherwise, it yields from the current async function or block. When resumed, it will poll the Future again and may or may not yield depending on whether the Future is ready.

Implementation-wise, the documentation suggests that Rust decomposes an async function (or block) into a state machine (see appendix) where each state represents an await site.

Async/await is not supposed to be used like coroutines. In fact, the book Asynchronous Programming in Rust contrasts async/await against coroutines. I have given an example in JavaScript that traverses nested list using async/await using two tasks and a channel. It is possible to do the same in Rust, but that’ll need a scheduler. Since I am too lazy to write a scheduler or introduce a third-party scheduler, I’ll try a different approach here.

I’ll abuse the async/await mechanism to exploit its underlying coroutine and make it behave like a generator.

Resume: We know that Future::poll resumes the coroutine. We call Future::poll directly, which is seldom done in usual async/await-based programs unless we are implementing the “executor” (i.e. scheduler).
Yield: Each .await corresponds to a yield site. We customise the behaviour of our Future object (i.e. WaitUntilResultTaken) so that the .await always yields (Pending) when reached from within the coroutine, but will continue (Ready) when resumed from the main function. The behaviour is controlled by the result_taken variable.

The async fn traverse will recursively call itself in the .await expression, and will yield level by level through all the .await sites to the main function.

Then we have a problem. Whenever it yields, the value yielded to the .poll() call site in main is always Poll::Pending. Then how do we pass the traversed numbers to main? To work around this, the coroutine writes the number in a shared variable result_reporter.num before it yields. This makes the ResultReporter struct effectively behave like a “zero-capacity queue” in our previous example, but it connects the generator and the main function instead of two tasks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
use async_recursion::async_recursion;
use std::future::Future;
use std::pin::Pin;
use std::sync::atomic::{AtomicBool, AtomicI32, Ordering};
use std::task::{Context, Poll};

#[derive(Default)]
struct ResultReporter {
    num: AtomicI32, // Rust is having problem figuring out whether the coroutine races with main,
    result_taken: AtomicBool, // so we just use atomic variables here.
}

impl ResultReporter {
    fn report_result<'a>(&'a self, num: i32) -> WaitUntilResultTaken<'a> {
        self.num.store(num, Ordering::Relaxed);
        self.result_taken.store(false, Ordering::Relaxed);
        WaitUntilResultTaken {
            result_taken: &self.result_taken,
        }
    }
    fn take_result(&self) -> i32 {
        let num = self.num.load(Ordering::Relaxed);
        self.result_taken.store(true, Ordering::Relaxed);
        num
    }
}

struct WaitUntilResultTaken<'a> {
    result_taken: &'a AtomicBool,
}

impl<'a> Future for WaitUntilResultTaken<'a> {
    type Output = ();

    fn poll(self: Pin<&mut Self>, _: &mut Context<'_>) -> Poll<Self::Output> {
        if self.result_taken.load(Ordering::Relaxed) {
            Poll::Ready(())
        } else {
            Poll::Pending
        }
    }
}

enum NestedList {
    Leaf(i32),
    Nested(Vec<NestedList>),
}

#[async_recursion]
async fn traverse(list: &NestedList, reporter: &ResultReporter) {
    match list {
        NestedList::Leaf(num) => {
            // This `await` expression will call `WaitUntilResultTaken::poll()` twice.
            // The first time is when we reach the `await` here.  It returns `Poll::Pending` so we yield.
            // The second time is when `main` calls `poll`.  It returns `Poll::Ready(())` so we continue.
            reporter.report_result(*num).await
        }
        NestedList::Nested(lists) => {
            for elem in lists {
                // This `await` will pass the `Poll::Pending` to the caller level by level until it reaches `main`.
                traverse(elem, reporter).await
            }
        }
    }
}

fn main() {
    // [1, [[2, 3], [4, 5]], [6, 7, 8]]
    let nested_list = NestedList::Nested(vec![
        NestedList::Leaf(1),
        NestedList::Nested(vec![
            NestedList::Nested(vec![NestedList::Leaf(2), NestedList::Leaf(3)]),
            NestedList::Nested(vec![NestedList::Leaf(4), NestedList::Leaf(5)]),
        ]),
        NestedList::Nested(vec![
            NestedList::Leaf(6),
            NestedList::Leaf(7),
            NestedList::Leaf(8),
        ]),
    ]);

    let result_reporter = ResultReporter::default();

    let mut coro = traverse(&nested_list, &result_reporter);

    let null_ctx = unsafe { &mut *std::ptr::null_mut::<Context>() };

    loop {
        let coro_p = unsafe { Pin::new_unchecked(&mut coro) };
        let poll_result = coro_p.poll(null_ctx); // Let the coroutine run.
        match poll_result {
            Poll::Pending => {
                // When pausing (at `await` sites) during execution, we get the result.
                let num = result_reporter.take_result();
                println!("{}", num);
            }
            Poll::Ready(()) => {
                // When finished, we just quit.
                break;
            }
        }
    }
}

Rust coroutine crate (stackful, asymmetric)

The third-part crate coroutine provides stackful asymmetric coroutines. It is built upon the context crate (see below).

It looks like this crate has not been maintained for quite some time and it won’t compile. I’ll not do the task using the coroutine crate. If you are interested, their documentation contains some examples.

Rust context crate (stackful, symmetric)

The third-part crate context is similar to [Boost Context]. It provides the abstraction of stackful symmetric coroutines.

It implements swap-stack using machine-specific assembly code (x86-64, AArch64, ppc64, sorry RISC-V).

The Context::resume method switches the thread to the other coroutine, and pass the context of the original coroutine so the thread can switch back.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
use context::stack::ProtectedFixedSizeStack;
use context::{Context, Transfer};

#[derive(Debug)]
enum NestedList {
    Leaf(i32),
    Nested(Vec<NestedList>),
}

fn traverse(list: &NestedList, t: &mut Option<Transfer>) {
    match list {
        NestedList::Leaf(num) => {
            // Context-switching consumes the `Context` object.  We have to take and replace it.
            let old_t = t.take().unwrap();
            let new_t = unsafe { old_t.context.resume(*num as usize) };
            *t = Some(new_t);
        }
        NestedList::Nested(lists) => {
            for elem in lists {
                // This is stackful coroutine.  We can do recursive function call.
                traverse(elem, t)
            }
        }
    }
}

fn main() {
    // [1, [[2, 3], [4, 5]], [6, 7, 8]]
    let nested_list = NestedList::Nested(vec![
        NestedList::Leaf(1),
        NestedList::Nested(vec![
            NestedList::Nested(vec![NestedList::Leaf(2), NestedList::Leaf(3)]),
            NestedList::Nested(vec![NestedList::Leaf(4), NestedList::Leaf(5)]),
        ]),
        NestedList::Nested(vec![
            NestedList::Leaf(6),
            NestedList::Leaf(7),
            NestedList::Leaf(8),
        ]),
    ]);

    extern "C" fn context_function(t: Transfer) -> ! {
        // The initial Transfer carries the pointer to the list as `data`.
        let list = unsafe { &*(t.data as *const NestedList) };

        // From now on, we'll frequently take and replace the Transfer. Use Option.
        let mut t_holder = Some(t);
        traverse(list, &mut t_holder);

        // Send a special value to indicate the end of traversal.
        let t = t_holder.unwrap();
        unsafe { t.context.resume(usize::MAX) };
        unreachable!();
    }

    let stack = ProtectedFixedSizeStack::default();

    let mut t = Transfer::new(unsafe { Context::new(&stack, context_function) }, 0);

    // The initial `resume` sends the list reference as a usize.
    t = unsafe { t.context.resume(&nested_list as *const NestedList as usize) };

    // Use this special value to indicate end of traversal.
    while t.data != usize::MAX {
        println!("{}", t.data);

        // Subsequent `resume` doesn't need to carry values.
        t = unsafe { t.context.resume(0usize) };
    }
}

C# iterators (stackless, asymmetric)

C# can implement iterators using the yield return or yield break statements. A function that contains yield returns an Enumerable<T> or Enumerator<T> which is resumed every time an item is requested.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
using System;
using System.Collections.Generic;

interface NestedList
{
    IEnumerable<int> Traverse();
}

class Leaf : NestedList
{
    private int num;

    public Leaf(int num)
    {
        this.num = num;
    }

    public IEnumerable<int> Traverse()
    {
        yield return num;
    }
}

class Branch : NestedList
{
    private List<NestedList> children;

    public Branch()
    {
        children = new List<NestedList>();
    }

    public Branch Add(NestedList child)
    {
        children.Add(child);
        return this;
    }

    public IEnumerable<int> Traverse()
    {
        foreach (var child in children)
        {
            // It is stackless.  We need to yield from the inner iterators.
            foreach (var y in child.Traverse())
            {
                yield return y;
            }
        }
    }
}

public class IteratorTraversal
{
    public static void Main()
    {
        // [1, [[2, 3], [4, 5]], [6, 7, 8]]
        var nestedList = new Branch()
            .Add(new Leaf(1))
            .Add(new Branch()
                    .Add(new Branch()
                        .Add(new Leaf(2))
                        .Add(new Leaf(3)))
                    .Add(new Branch()
                        .Add(new Leaf(4))
                        .Add(new Leaf(5))))
            .Add(new Branch()
                    .Add(new Leaf(6))
                    .Add(new Leaf(7))
                    .Add(new Leaf(8)));

        foreach (var n in nestedList.Traverse()) // Create the iterator
        {
            Console.WriteLine(n);
        }
    }
}

C# async/await (stackless, asymmetric, asynchronous)

C# supports asynchronous programming (see appendix). Like async/await in any other language, its programming model is more like cooperative multi-task programming than coroutines. It is possible to implement asynchronous traversal using an AsyncQueue to communicate between the traversal task and a consumer task that consumes the visited values. I am not going to give an example here.

C swapcontext (stackful, symmetric)

The C programming language itself doesn’t have any support for coroutines or swap-stack.

The POSIX function makecontext can create a “context” for a function on a given stack so that when a subsequent invocation of swapcontext swaps to that context, it will continue execution from the beginning of the given function on the given stack. This effectively provides a swap-stack mechanism, and can be used to implement symmetric coroutines.

One design flaw of makecontext is that it only supports passing int arguments to the given function. Because the size of int is platform-specific, it is hard to pass pointers across makecontext. According to the man page, POSIX 2008 removed makecontext and swapcontext, citing portability issues, and recommended the use of POSIX threads.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <ucontext.h>

struct Node {
    int value;
    struct Node *first_child;
    struct Node *next_sibling;
};

// [1, [[2, 3], [4, 5]], [6, 7, 8]]
struct Node node8        = { .value = 8,  .first_child = NULL,    .next_sibling = NULL      };
struct Node node7        = { .value = 7,  .first_child = NULL,    .next_sibling = &node8    };
struct Node node6        = { .value = 6,  .first_child = NULL,    .next_sibling = &node7    };
struct Node node678      = { .value = -1, .first_child = &node6,  .next_sibling = NULL      };
struct Node node5        = { .value = 5,  .first_child = NULL,    .next_sibling = NULL      };
struct Node node4        = { .value = 4,  .first_child = NULL,    .next_sibling = &node5    };
struct Node node45       = { .value = -1, .first_child = &node4,  .next_sibling = NULL      };
struct Node node3        = { .value = 3,  .first_child = NULL,    .next_sibling = NULL      };
struct Node node2        = { .value = 2,  .first_child = NULL,    .next_sibling = &node3    };
struct Node node23       = { .value = -1, .first_child = &node2,  .next_sibling = &node45   };
struct Node node2345     = { .value = -1, .first_child = &node23, .next_sibling = &node678  };
struct Node node1        = { .value = 1,  .first_child = NULL,    .next_sibling = &node2345 };
struct Node node12345678 = { .value = -1, .first_child = &node1,  .next_sibling = NULL      };

ucontext_t main_context;
ucontext_t coro_context;
int current_value;
bool finished;

void do_yield(int value) {
    current_value = value;
    swapcontext(&coro_context, &main_context);  // switch back to main
}

void visit_node(struct Node *node) {
    for (struct Node *current = node; current != NULL; current = current->next_sibling) {
        if (current->first_child == NULL) {
            do_yield(current->value);
        } else {
            visit_node(current->first_child);   // recursive call
        }
    }
}

void traverse() {
    finished = false;
    visit_node(&node12345678);
    finished = true;
}

int main() {
    getcontext(&coro_context);

    // Obviously, it is stackful.
    void *stack = malloc(65536);    // Not sure if 65536 is enough, though.
                                    // If we are careful enough, we should use mprotect
                                    // to create a PROT_NONE region to protect against
                                    // stack overflow.
    coro_context.uc_stack = (stack_t){ .ss_sp = stack, .ss_size = 4096 };
    coro_context.uc_link = &main_context;
    makecontext(&coro_context, (void(*)())traverse, 0);

    while (true) {
        swapcontext(&main_context, &coro_context);  // switch to coroutine
        if (finished) {
            break;
        } else {
            printf("%d\n", current_value);
        }
    }

    return 0;
}

C++ coroutine (stackless, asymmetric, asynchronous)

C++20 introduced “coroutines”. More precisely, it introduced mechanisms so that libraries can implement asynchronous programming on top of them.

Any function that contains co_await, co_yield and/or co_return are coroutine functions.

Calling a coroutine function behaves like the pseudo-code defined in Section 9.5.4 (dcl.fct.def.coroutine) paragraph 5.

{
    promise-type promise promise-constructor-arguments ;
    try {
        co_await promise.initial_suspend() ;
        function-body
    } catch ( ... ) {
        if (!initial-await-resume-called)
            throw ;
        promise.unhandled_exception() ;
    }
final-suspend :
    co_await promise.final_suspend() ;
}

Basically, it implicitly creates a “promise object” which is called (awaited) at different points of the coroutine execution, like “hooks” or aspect-oriented programming. The compiler will generate those promise.xxxx() method calls, and it is the programmer’s (or library writer’s) responsibility to define the “promise type” and the .initial_suspend(), .unhandled_exception(), and .final_suspend() methods to make the generated calls meaningful.

And evaluation a co_await expression behaves as defined in Section 7.6.2.3 (expr.await) paragraph 5. It takes an “awaiter” (more precisely, anything that can be converted to an “awaiter”) as operand, and

It calls awaiter.await_ready().
- If it returns true, it calls awaiter.await_resume(), and that’s the value of the co_await expression.
- If it returns false, it calls awaiter.await_suspend(), and depending on its result, it may suspend the execution of the coroutine, or suspend and switch to another coroutine, or just continue execution.

It’s the “conditionally yielding” behaviour of common async/await-based programming. Again the compiler generates the above method calls, and it is again the programmer’s (or library writer’s) responsibility to implement the .await_ready(), .await_resume() and .await_suspend() methods to make those generated calls meaningful.

The co_yield expression calls promise.yield_value(x), and programmer (or library writer) shall implement the promise object to make .yield_value method meaningful.

And the co_return statement calls promise.return_void(), and the programmer (or library writer) shall implement the promise object to make .return_void method meaningful.

And the standard library function coroutine_handle::resume() resumes a paused “coroutine”.

As we can see

the specification defines the semantics of coroutine functions, the co_await, co_yield and co_return expressions/statements, and the coroutine_handle library object and its .resume() method, and
the compiler (GCC, Clang, etc.) generate code for the co_xxx expression/statements, and
the programmer defines the promise type and (optionally) “awaiter” types and fills in lots and lots of methods to customise the behaviour.

How complicated C++20 “coroutines” are!

It is complicated enough to confuse a C++ programmer with 25-years’ experience.

And I admit I am not smart enough to use C++20 coroutines.

If you are not smart enough, either, but want to learn about C++20 coroutines, I recommend starting with another language, such as Ruby or Lua, and come back to C++20 when the idea of coroutines don’t scare you.

Anyway, here is the code. The following code tries to use C++20 coroutines as generators.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
#include <iostream>
#include <coroutine>

struct Node {
    int value;
    Node *first_child;
    Node *next_sibling;
};

// [1, [[2, 3], [4, 5]], [6, 7, 8]]
Node node8        = { .value = 8,  .first_child = nullptr, .next_sibling = nullptr   };
Node node7        = { .value = 7,  .first_child = nullptr, .next_sibling = &node8    };
Node node6        = { .value = 6,  .first_child = nullptr, .next_sibling = &node7    };
Node node678      = { .value = -1, .first_child = &node6,  .next_sibling = nullptr   };
Node node5        = { .value = 5,  .first_child = nullptr, .next_sibling = nullptr   };
Node node4        = { .value = 4,  .first_child = nullptr, .next_sibling = &node5    };
Node node45       = { .value = -1, .first_child = &node4,  .next_sibling = nullptr   };
Node node3        = { .value = 3,  .first_child = nullptr, .next_sibling = nullptr   };
Node node2        = { .value = 2,  .first_child = nullptr, .next_sibling = &node3    };
Node node23       = { .value = -1, .first_child = &node2,  .next_sibling = &node45   };
Node node2345     = { .value = -1, .first_child = &node23, .next_sibling = &node678  };
Node node1        = { .value = 1,  .first_child = nullptr, .next_sibling = &node2345 };
Node node12345678 = { .value = -1, .first_child = &node1,  .next_sibling = nullptr   };

struct Traverser {
    struct promise_type;
    using handle_type = std::coroutine_handle<promise_type>;

    struct promise_type {
        int current_value_ = -1;
        bool finished_ = false;

        // This is executed when the "coroutine" is created.
        Traverser get_return_object() {
            return Traverser(handle_type::from_promise(*this));
        }

        // Called at the beginning of the coroutine.
        // We let it stop there to mimic Python generator behaviour.
        std::suspend_always initial_suspend() {
            return {};
        }

        // Called when the coroutine finished execution.
        std::suspend_always final_suspend() noexcept {
            // We set a variable so the main function knows it finished.
            finished_ = true;
            return {};
        }

        // Called when a co_yield expression is evaluated.
        std::suspend_always yield_value(int value) {
            current_value_ = value;
            return {};
        }

        // Called when an exception is thrown.
        void unhandled_exception() {
            std::terminate();
        }
    };

    handle_type handle_;

    Traverser(handle_type handle) : handle_(handle) {}

    void resume() {
        handle_.resume();
    }

    bool finished() const {
        return handle_.promise().finished_;
    }

    int get_value() {
        return handle_.promise().current_value_;
    }
};

Traverser visit_node(Node *node) {
    for (Node *current = node; current != nullptr; current = current->next_sibling) {
        if (current->first_child == nullptr) {
            // Yield.
            co_yield current->value;
        } else {
            // It is stackless.  We need multiple levels of coroutines.
            auto sub_traverser = visit_node(current->first_child);
            while (true) {
                sub_traverser.resume();
                if (sub_traverser.finished()) {
                    break;
                }
                // Yield from sub-coroutine.
                co_yield sub_traverser.get_value();
            }
        }
    }
}

int main() {
    auto traverser = visit_node(&node12345678);
    while (true) {
        traverser.resume();
        if (traverser.finished()) {
            break;
        }
        std::cout << traverser.get_value() << std::endl;
    }

    return 0;
}

C++ Boost.Coroutine2 (stackful, asymmetric)

Note: Boost.Coroutine is deprecated in favour for Boost.Coroutine2.

Boost.Coroutine2 provides stackful asymmetric coroutines. It is implemented on top of Boost.Context (see below).

A boost::coroutines2::coroutine<T> has two members: pull_type and push_type. A coroutine can be created by instantiating either of them. In this task, we will create a pull_type so that the main function can pull data from the coroutine. It will create a coroutine which receives a &push_type so that it can yield and push data back to the main function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#include <iostream>
#include <boost/coroutine2/coroutine.hpp>

struct Node {
    int value;
    Node *first_child;
    Node *next_sibling;
};

// [1, [[2, 3], [4, 5]], [6, 7, 8]]
Node node8        = { .value = 8,  .first_child = nullptr, .next_sibling = nullptr   };
Node node7        = { .value = 7,  .first_child = nullptr, .next_sibling = &node8    };
Node node6        = { .value = 6,  .first_child = nullptr, .next_sibling = &node7    };
Node node678      = { .value = -1, .first_child = &node6,  .next_sibling = nullptr   };
Node node5        = { .value = 5,  .first_child = nullptr, .next_sibling = nullptr   };
Node node4        = { .value = 4,  .first_child = nullptr, .next_sibling = &node5    };
Node node45       = { .value = -1, .first_child = &node4,  .next_sibling = nullptr   };
Node node3        = { .value = 3,  .first_child = nullptr, .next_sibling = nullptr   };
Node node2        = { .value = 2,  .first_child = nullptr, .next_sibling = &node3    };
Node node23       = { .value = -1, .first_child = &node2,  .next_sibling = &node45   };
Node node2345     = { .value = -1, .first_child = &node23, .next_sibling = &node678  };
Node node1        = { .value = 1,  .first_child = nullptr, .next_sibling = &node2345 };
Node node12345678 = { .value = -1, .first_child = &node1,  .next_sibling = nullptr   };

typedef boost::coroutines2::coroutine<int> coro_t;

void visit_node(Node *node, coro_t::push_type &sink) {
    for (Node *current = node; current != nullptr; current = current->next_sibling) {
        if (current->first_child == nullptr) {
            sink(current->value);   // Yield at any level.  It's stackful!
        } else {
            visit_node(current->first_child, sink);   // Recursive call.
        }
    }
}

int main() {
    auto traverser = coro_t::pull_type([](coro_t::push_type &sink) {
        visit_node(&node12345678, sink);
    });

    for (auto value : traverser) { // Can be used as iterator using begin() and end().
        std::cout << value << std::endl;
    }

    return 0;
}

C++ Boost.Context (stackful, symmetric)

Boost.Context implements a swap-stack mechanism using machine-dependent assembly language(x86-64, ARM64, RISCV64, etc.). We can also implement symmetric coroutines using its C++ API. A boost::coroutine::fiber represents a coroutine, and fiber::resume() switches to that coroutine. Because fiber::resume() doesn’t pass values, we need to use a side channel (the shared State object) to pass the value and indicate that the traversal has finished.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <iostream>
#include <boost/coroutine2/coroutine.hpp>

struct Node {
    int value;
    Node *first_child;
    Node *next_sibling;
};

// [1, [[2, 3], [4, 5]], [6, 7, 8]]
Node node8        = { .value = 8,  .first_child = nullptr, .next_sibling = nullptr   };
Node node7        = { .value = 7,  .first_child = nullptr, .next_sibling = &node8    };
Node node6        = { .value = 6,  .first_child = nullptr, .next_sibling = &node7    };
Node node678      = { .value = -1, .first_child = &node6,  .next_sibling = nullptr   };
Node node5        = { .value = 5,  .first_child = nullptr, .next_sibling = nullptr   };
Node node4        = { .value = 4,  .first_child = nullptr, .next_sibling = &node5    };
Node node45       = { .value = -1, .first_child = &node4,  .next_sibling = nullptr   };
Node node3        = { .value = 3,  .first_child = nullptr, .next_sibling = nullptr   };
Node node2        = { .value = 2,  .first_child = nullptr, .next_sibling = &node3    };
Node node23       = { .value = -1, .first_child = &node2,  .next_sibling = &node45   };
Node node2345     = { .value = -1, .first_child = &node23, .next_sibling = &node678  };
Node node1        = { .value = 1,  .first_child = nullptr, .next_sibling = &node2345 };
Node node12345678 = { .value = -1, .first_child = &node1,  .next_sibling = nullptr   };

struct State {
    int current_value = -1;
    bool finished = false;
};

void visit_node(Node *node, State &state, boost::context::fiber &sink) {
    for (Node *current = node; current != nullptr; current = current->next_sibling) {
        if (current->first_child == nullptr) {
            state.current_value = current->value;
            sink = std::move(sink).resume();   // Yield at any level.  It's stackful!
        } else {
            visit_node(current->first_child, state, sink);   // Recursive call.
        }
    }
}

int main() {
    State state;

    auto traverser = boost::context::fiber([&state](boost::context::fiber &&sink) {
        visit_node(&node12345678, state, sink);
        state.finished = true;
        return std::move(sink);
    });

    while (true) {
        traverser = std::move(traverser).resume();  // Resume the coroutine.
        if (state.finished) {
            break;
        }
        std::cout << state.current_value << std::endl;
    }

    return 0;
}

Appendices

Why this task?

The purpose of this task is to compare symmetric, asymmetric, stackful and stackless coroutines.

This task is natural to implement with coroutines, because the traversal of a data structure is relatively independent from the consumption of the values.

This task is also much easier with stackful coroutines than stackless coroutines. Because the data structure (nested lists) is recursive, it is easier to traverse it using a recursive algorithm, and recursion needs a stack. Stackful coroutines can handle recursive calls quite trivially. But when using stackless coroutines, we have to do some hack and chain up multiple coroutines to form a stack of coroutines, and yield values from the innermost coroutine through multiple layers of coroutines to the consumer.

From the code examples above, we can clearly see the difference between different kinds of coroutines.

What are coroutines anyway?

According to Conway, a coroutine is “an autonomous program which communicates with adjacent modules as if they were input or output subroutines”. Coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program.

Coroutines has the following characteristics. (See Coroutines in Lua)

The values of data local to a coroutine persist between successive calls.
The execution of a coroutine is suspended as control leaves it, only to carry on where it left off when control re-enters the coroutine at some later stage.

The first characteristic means coroutines can be resumed from where it paused.

In the second characteristic, “as control leaves it” means it is the programmer that decides where to pause a coroutine, not the implicit scheduler. Control flow is part of a program, not the runtime.

For this reason, the “fibers” in Ruby, the “generators” in Python, the “threads” in Lua, and the “user contexts” in the swapcontext POSIX API are all coroutines, despite not being called “coroutines”. The programmer explicitly transfers control from one to another.

On the other hand, a “goroutine” in the Go programming language is not a coroutine, despite the similar name. In fact, the official document defines a “goroutine” as “an independent concurrent thread of control”, i.e. a thread.

Note: Goroutine is a threading mechanism implemented in the user space, an “M × N” green thread system. The Go runtime schedules M goroutines on N native threads. It is the scheduler that switches a native thread between different goroutines at unspecified time and locations, usually when IO operations may block, or when queues are full or empty. To the programmer, a goroutine is just like a thread: it keeps going forward. It may block, but not because the programmer asked it to yield, but because some requests cannot be satisfied, such as queues and IO.

Coroutines are not mutually exclusive with threads. Each thread is executing one coroutine at a time, and each thread may jump from one coroutine to another according to the control flow in the program.

Symmetric and asymmetric coroutines.

With symmetric coroutines, all coroutines are equal. A thread can jump from any coroutine to any other coroutine. When switching, the programmer always needs to specify which coroutine to jump to, i.e. destination.

With asymmetric coroutines, coroutines have a parent-child relation. There are two different operations that jumps between coroutines, namely resume and yield. When a thread “resumes” a coroutine, the destination becomes the child of the source coroutine, until it “yields”, when the thread jumps back from the child to the parent coroutine. A coroutine cannot be resumed when it already has a parent. When yielding, the programmer doesn’t need to specify the destination, because the destination is always implicitly the parent coroutine.

Symmetric and asymmetric coroutines have equal expressive power, and they can implement each other. See Coroutines in Lua for more details.

Stackful and stackless coroutines.

A stackful coroutine has its own execution stack. A thread can switch coroutine when the current coroutine has any number of frames on its stack. When switching, the thread saves the entire stack and switches to a whole new stack (implementation-wise, it only needs to save registers, change the stack pointer, and restore registers from the new stack).

A stackless coroutine has only one frame. Therefore, a coroutine is usually defined by a “coroutine function”. From my knowledge, all stackless coroutines are asymmetric. A coroutine function can only yield within the coroutine function itself. That means when a coroutine C1 calls another function F2, the thread cannot yield from C1 while executing F2; when a coroutine C1 resumes another coroutine C2, the thread can only yield from C2 back to C1, but not directly from C2 to C1’s parent.

Stackful coroutines are more powerful, but need some kind of “swap-stack” mechanism (see below) to implement. Stackless coroutines are more restricted, but does not require swap-stack.

The swap-stack mechanism

Conceptually, the context of nested function calls is a stack, a last-in-first-out data structure, called the control stack, often simply called a stack.

A control stack has many frames. Each frame contains the execution context of a function activation, (and this is why a frame is also known as an activation record). The context includes the program counter (PC) as well as the values of local variables. The top frame is the context of the most recently entered function, and is the only frame on a stack that is active. All other frames are paused at a call site, waiting for the called function (the frame above it) to return.

In most programming languages, a thread is always bound to one stack. But more generally, a thread can switch among different stacks. When a thread switches from one stack to another, it pauses the top frame as well making the whole stack paused. Then it switches its stack pointer to the new stack, and resume the top frame of the new stack, therefore continue from where that frame was paused.

This is basically what stackful coroutine does. Each coroutine has a stack, which can be paused and resumed. When resumed, it continues from where it was paused.

Implementing swap-stack with compiler

Implementation-wise, if the language is compiled, it needs a special instruction sequence that does the following things:

Save live registers on the top of the current stack, and
set the stack pointer (SP) register to the destination stack, and
restore the saved registers from the top of the destination stack.

The C and C++ programming languages themselves do not have support for swap-stack. In practice, we usually rely on libraries or compiler extensions to do that in C or C++.

Here I give two examples of implementations of swap-stack for compiled code.

One is from Boost.Context. It is implemented as a library in the assembly language, therefore it has to be platform-specific. Here are the code for x86-64, ARM64 and RISCV64. Because it is implemented as a library, it can only depend on the application binary interface (ABI) of the platform. In this case, the assembly code has to conservatively save all callee-saved registers no matter whether they are still in use or not, because as a library, it does not have the liveness information the compiler has.
The other is an LLVM extension created by Dolan et al.. As part of a compiler framework, it can identify and save only live registers, making it much more efficient than library-based approaches.

Implementing swap-stack with interpreter

If the language is interpreted, it depends. An interpreter can be stackful or stackless, and even stackless interpreters can allow the interpreted functions to call foreign C functions.

A stackful interpreter uses the native (C) stack to implement function invocation. Such interpreters usually has the following form:

void interpret_function(Frame *frame) {
    while (true) {
        switch (current_instruction().type()) {
        case ...:
            ...
        case CALL:
            ...
            Frame *new_frame = create_frame(called_function);
            interpret_function(new_frame);  // Recursive call
            ...
        }
    }
}

Because a language-level function call corresponds to an interpreter-level (C-level) function call, each frame in the interpreted language corresponds to a C frame. It is difficult to implement swap-stack with stackful interpreter because it needs to swap out the C stack (which has C frames) in order to swap out the interpreted language stack. As we have discussed before, C does not have native support for swap-stack, and it needs libraries written in assembly language or compiler extensions to do so.

What about stackless interpreters?

A stackless interpreter does not turn language-level function calls into C-level function calls. A stackless interpreter usually has the following form:

void interpret_function(Frame *initial_frame) {
    Frame *current_frame = initial_frame;
    while (true) {
        switch (current_instruction().type()) {
        case ...:
            ...
        case CALL:
            ...
            Frame *new_frame = create_frame(called_function);
            new_frame->parent = current_frame;
            current_frame = new_frame;  // Only replace the frame pointer
            ...
        }
    }
}

A stackless interpreter always remains in the single interpret_function function activation even when the interpreted language program makes a call. Swap-stack is relatively easier to implement with stackless interpreter, because it does not need to swap out any C frames…

… unless it allows foreign function calls. If the stackless interpreter allows the interpreted language to call foreign C functions, then C functions must have frames on some stack. Then we face the same problem as implementing swap-stack for compiled languages.

The Mu micro virtual machine

I designed the Mu micro virtual machine, and it is the main part of my PhD thesis. Swap-stack is a very important mechanism of the Mu micro VM, and it is designed to be supported by the JIT compiler. It enables the implementation of symmetric stackful coroutines, and it is the foundation of other VM mechanisms, such as trapping and on-stack replacement (OSR). If you are interested, read Section 5.3.6 of my thesis.

Decomposing a function into a state machine

Interpreters usually have no problem saving the frame of a function at a yield point so that it can be resumed later. The interpreter can implement the layout of stack frames and the behaviour of function calls / coroutine resumption in any way it wants. They may even allocate frames in the heap so that they can temporarily remove a frame from the stack and put it back later. For compilers, if swap-stack is available, one thread can just save the register states on one stack and restore them from another stack. Without swap-stack, however, it may be a challenge.

One way to implement pause-able and resume-able functions is decomposing a function into a state machine. Each yield point becomes a state, and a function starts by matching the state and jumping to the right place.

For example, assume the yield() call represents a coroutine yield in the following C pseudo-code:

void foo() {
    printf("Long\n");
    yield();
    printf("time\n");
    yield();
    printf("no\n");
    yield();
    printf("see\n");
    return;
}

Such a function can be transformed into a function that takes a state when called, and returns a new state when yielding or returning.

enum State {
    START, STATE1, STATE2, STATE3, END
};

enum State foo(enum State state) {
    switch(state) {
    case START:
        printf("Long\n");
        return STATE1;

    case STATE1:
        printf("time\n");
        return STATE2;

    case STATE2:
        printf("no\n");
        return STATE3;

    case STATE3:
        printf("see\n");
        return END;
    }
}

int main() {
    enum State state = START;
    while (state != END) {
        printf("Resuming foo()...\n");
        state = foo(state);
        printf("foo() paused\n");
    }
}

What about local variables? Local variables can be packed into the states, too. C programmers may use the union type, but it is easier with tagged unions or object-oriented programming.

Suppose we have a (pseudo) Rust function where yield!() represents a coroutine yield.

fn square_sum(a: i32, b: i32) -> Future<i32> {
    let a2 = a * a;
    println!("a * a = {}", a2);
    yield!();

    let b2 = b * b;
    println!("b * b = {}", b2);
    yield!();

    let result = a2 + b2;
    println!("result = {}", result);
    return result;
}

We can use an enum to hold live (will be used later) local variables at each yield!() point.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/// Note: each state holds the live local variables.
enum State {
    Start { a: i32, b: i32 },
    State1 { b: i32, a2: i32 },  // Note: a is no longer useful.
    State2 { a2: i32, b2: i32 }, // Note: neither a nor b are useful now.
    End { result: i32 },
}

/// Calling this function merely gets the initial state.
/// It doesn't actually execute the body of the original function.
fn square_sum(a: i32, b: i32) -> State {
    State::Start { a, b }
}

/// Call this for each step.
fn square_sum_step(state: State) -> State {
    match state {
        State::Start { a, b } => { // Restore local variables from the state.
            let a2 = a * a;
            println!("a * a = {}", a2);
            State::State1 { b, a2 } // Save useful local variables into state.
        }

        State::State1 { b, a2 } => { // restore
            let b2 = b * b;
            println!("b * b = {}", b2);
            State::State2 { a2, b2 } // save
        }

        State::State2 { a2, b2 } => { // restore
            let result = a2 + b2;
            println!("result = {}", result);
            State::End { result } // save
        }

        State::End { .. } => {
            panic!("Coroutine already finished!");
        }
    }
}

fn main() {
    let mut state = square_sum(3, 4);
    loop {
        match state {
            State::End { result } => {
                println!("Execution finished. Result is {}", result);
                break;
            }
            _ => {
                println!("Resuming...");
                state = square_sum_step(state);
                println!("Yielded.");
            }
        }
    }
}

Asynchronous programming (async/await) and coroutines

In asynchronous programming, a program consists of many tasks that can be completed in the future, and one task can wait for other tasks to complete before continuing. There are many ways to implement asynchronous programming. It can be trivially executed inline (e.g. the X10 compiler is allowed to inline an async activity), executed sequentially, using threads, or using coroutines.

The notion of “Future” and its friend “Promise” are well-known in multi-thread programming (C++, Java, C# and Python). A pair of Future and Promise represents a value yet to be produced. The Future waits for the value to be produced, and the Promise is a place to store the value to be acquired via the Future. C++ even has the std::async function in the standard library to launch an asynchronous task in a new thread.

Recently many programming languages employed a style of asynchronous programming language based on coroutines in the form of async/await. I guess the reason behind its gaining popularity is two fold:

Native, OS-provided threads are too heavy-weight, but not many programming languages support light-weight “M × N” green threads, that is, M OS threads are multiplexed to run N application-level threads and N ≫ M. AFAIK, only Erlang and Go supports such light-weight threads.
Not many languages support stackful coroutines. As we discussed before, stackful coroutines are only practical with swap-stack. Some languages (such as Kotlin) are targeted to runtimes (such as JVM) that don’t support swap-stack.

As a compromise, some languages resorted to coroutines. They attempted to implement cooperative multi-tasking using coroutines that yield when they are about to block, and a scheduler that decides which coroutine can continue without blocking. And there is async/await.

A function can be annotated with the async keyword. An async function is like a Python generator. When called, it doesn’t execute the body of the function immediately, but will create an object that holds the execution context of the function, like a frame. An async function may contain await expressions. An await is like a conditional yield. If a given Future is ready, then grab the value and continue; otherwise, suspend the execution and give control to the scheduler so that it can find something else to execute.

Async and await gives the programmer the feeling of multi-thread programming except that the programmer must explicitly annotate places that may potentially yield with await.

You can find an async/await example in JavaScript earlier in this post. It looks pretty like two threads communicating with each other using a channel.

Consequence of being stackless

Without proper swap-stack support, the compiler has to implement coroutines by decomposing async functions into state machines. await expressions are places the function may yield, and each await represents a state in the state machine.

However, async/await is not the only way to implement cooperative multi-task programming on top of coroutines. gevent is a Python framework based on Greenlets which implement symmetric coroutines. With the ability to switch coroutine at any level of stack, each coroutine can yield to the scheduler as part of potentially blocking functions (such as sleeping, IO operations, etc.), and programmers do not need to annotate any expression with await.

tags: coroutine - rosettacode

Traversing nested lists with coroutines, Rosetta Code style

The task

The code

Ruby fibers (stackful, both asymmetric and symmetric)

Asymmetric

Symmetric

Lua threads (stackful, asymmetric)

Python generators (stackless, asymmetric)

Python coroutines (WTF?)

Python greenlet (stackful, symmetric)

Symmetric

Emulate asymmetric coroutine using the parent field

JavaScript generators (stackless, asymmetric)

JavaScript async/await (stackless, asymmetric, asynchronous)

JavaScript async generators (stackless, asymmetric, asynchronous)

Rust async/await (stackless, asymmetric, asynchronous)

Rust coroutine crate (stackful, asymmetric)

Rust context crate (stackful, symmetric)

C# iterators (stackless, asymmetric)

C# async/await (stackless, asymmetric, asynchronous)

C swapcontext (stackful, symmetric)

C++ coroutine (stackless, asymmetric, asynchronous)

C++ Boost.Coroutine2 (stackful, asymmetric)

C++ Boost.Context (stackful, symmetric)

Appendices

Why this task?

What are coroutines anyway?

Symmetric and asymmetric coroutines.

Stackful and stackless coroutines.

The swap-stack mechanism

Implementing swap-stack with compiler

Implementing swap-stack with interpreter

The Mu micro virtual machine

Decomposing a function into a state machine

Asynchronous programming (async/await) and coroutines

Consequence of being stackless

Emulate asymmetric coroutine using the `parent` field