Matthias Portzel

Ownership in Zig and JavaScript

I've long felt that Zig makes sense to me as a result of my background as a JavaScript programmer, despite how different the languages are. But I was never able to put my finger on exactly why (both have C-like syntax, but it's more than that).

Some Instructive Zig Code

I read a piece of code recently by a Zig beginner. It looked like this (editorialized for the sake of discussion):

pub const Config = struct {
    alloc: std.mem.Allocator,
    sets: std.ArrayList([]usize),
};

pub fn intersectAllSlices(config: *const Config, set: []usize) []usize {
    var incoming = set;

    for (config.sets.items) |config_set| {
        incoming = intersectSlices(config.alloc, config_set, incoming);
    }

    return incoming;
}

pub fn intersectSlices(alloc: std.mem.Allocator, a: []usize, b: []usize) []usize {
    var out = std.ArrayList(usize).initCapacity(alloc, a.len) catch @panic("OOM");
    for (a) |item|
        if (std.mem.containsAtLeastScalar(usize, b, 1, item))
            out.appendAssumeCapacity(item);

    return out.toOwnedSlice(alloc) catch @panic("OOM");
}

The idea is simple, we have a list of sets (stored in our config object), and a new set. We want to find all items that are in all of the existing sets and the new set. We have two functions, one which implements set intersection (intersectSlices), and one which intersects our many sets (intersectAllSlices).

There are two issues which should jump out to an intermediate Zig programmer. First, we leak memory (never freeing the ArrayLists created in intersectSlices), and second, all of the slices should be const.

JavaScript Port

Now, my first step when working on any programming project is to convert the code to JavaScript, as that's easier for me to work with. And of course doing this gets rid of both of the mentioned issues. JavaScript doesn't care about const types and it has a garbage collector.

function intersectAllSlices(config, set) {
    let incoming = set;

    for (const config_set of config.sets)
        incoming = intersectSlices(config_set, incoming);

    return incoming;
}

function intersectSlices(a, b) {
    let out = [];
    for (const item of a)
        if (b.includes(item))
          out.push(item);

    return out;
}

The Third Issue

But what's interesting to me is that both programs share a third issue: ownership, or the lack thereof. Let's look at intersectAllSlices in the JavaScript again, and ignore intersectSlices.

function intersectAllSlices(config, set) {
    let incoming = set;

    for (const config_set of config.sets) {
        incoming = // ...
    }

    return incoming;
}

Normally, the assignment in the loop happens, and a new array is assigned to incoming and returned. The issue with this code is that if config.set is empty, the inner assignment is skipped, and set is returned.

Now, of course, this isn't always an issue. It's not a bug, per se, but it's a footgun.

Since arrays in JavaScript are passed by reference, a caller doesn't know if the value returned from intersectAllSlices is safe to modify or not.

const set = [1, 2, 3];
const result = intersectAllSlices(config, set);

// Function that modifies result
result.unshift();
// Operation that uses set
console.log(set);

Is this code correct? What does it print? We don't know. It could print [1, 2, 3] or [2, 3] depending on the value of config.

Introducing Ownership

Obviously bugs arising from these sorts of double-reference footguns are exremely common in languages like JavaScript and Ruby, and there's an entire class of languages that make all objects immutable to prevent this. You can eliminate the footgun in this Zig program by doing what I mentioned earlier: replacing []usize with []const usize. (Although it doesn't fix the memory leak.)

But I think there's a better solution, and that's introducing an explit owner for every piece of data.

I'm not qualified to give an expansive definition of ownership, but the basic idea is:

Every piece of data has one owner (normally a function)
The owner is responsible for transferring ownership or freeing the memory before the program exits
(Normally) only the owner is allowed to mutate the data (others can only read)
(Normally) the "flow" of ownership should be possible to determine statically (i.e. by reading the code at compile time, without knowing anything about the data)

Let's look at the code again and try to trace the ownership.

function intersectSlices(a, b) {
    let out = [];
    //...
    return out
}

intersectSlices creates a new peice of data. The owner of out is intersectSlices. When out is returned, the ownership transfers to the calling function.

function intersectAllSlices(config, set) {
    let incoming = set;

    for (const config_set of config.sets)
        incoming = intersectSlices(config_set, incoming);

    return incoming;
}

Looking at this function, incoming is the result of intersectSlices, which we just said is owned by this function. Before this function return, we transfer ownership of that value to the calling function. That's all great. We're only reading config, so we don't need ownership of that.

But look at set. Again, what happens to set depends on the runtime values, so we can't create ownership rules that are static. This is bad programming practice, but it doesn't stop us from analyzing the ownership rules. Since set isn't modified by the function, we don't need ownership of it. So we have to decide, more or less arbitrarily, whether its owned or not, and that will effect whether or not the return value is owned by the caller. Here's the 2x2 breakdown of the options:

             |  set is referenced  |  set is owned      
-----------------------------------|--------------------
 config.sets | return value        |  return            
  is empty   |  isn't owned        |   value owned      
--------------------------------------------------------
 config.sets | return value        |  return value owned
  has values |    owned            |   set is "dead"

The bottom right is interesting. Because we don't return set, but we do own it, that means it's without an owner after the function returns. The first rule of ownership was that everything has an owner. This means that set must not exist after the function returns. In a garbage collected language we don't have a way of explicitly destroying it, but as long as it's not read or written after the function returns, the ownership rules are satisfied.

The other results should be pretty self-explanatory. For example, top left, since we return `sets, but don't own it, the return value cannot be owned. Bottom left, we own the return value of intersectSlices, so we have to return ownership of it.

You can see how, if we allow for runtime ownership rules, we can satisfy the ownership requirements. And we end up with a clear contract for how the function can be used safely. But it's difficult to reason about, and it imposes non-local runtime-dependant ownership constraints on the anything that depends on the return value of the function.

Fixing the Issue

Let's rewrite this code from scratch in Zig, taking into account ownership.

We immediately have to decide how to handle ownership. We two options; taking ownership of set, and mutating it. Or, referencing set, creating a new (owned) value, and returning that. Since we're porting the original code, and the JavaScript code, I'm going to go with option 2, creating a new value.

pub fn intersectAllSlices(config: *const Config, set: []const usize) []usize

You can see, because the caller owns the new returned list, the return type is a []size, but the passed-in set is const, since it's only referenced.

Now let's look at the whole function:

pub fn intersectAllSlices2(config: *const Config, set: []const usize) []usize {
    const copy = config.alloc.dupe(usize, set) catch @panic("OOM");
    var out: std.ArrayList(usize) = .fromOwnedSlice(copy);

    for (config.sets.items) |config_set| {
        removeNonexistant(usize, &out, config_set);
    }

    return out.toOwnedSlice(config.alloc) catch @panic("OOM");
}

We copy set at the beginning, so we now own out and can use it as our return value, passing ownership to the caller with toOwnedSlice.

But what happened to intersectSlices? Well, intersect and removeNonexistant are really the same thing if you think about, but the later is going to mutate the set by removing everything not in the second set. If you're familiar with Zig, this change should make sense—Zig doesn't like unnecessary allocations. And in this case, we have ownership of out (so we can modify it freely), and we don't need the old value. Instead of doing out = // newly allocated set, let's skip the allocations and just modify it.

pub fn removeNonexistant(T: type, list: *std.ArrayList(T), to_keep: []const T) void {
    var i: usize = 0;
    while (i < list.items.len) {
        if (std.mem.findScalar(T, to_keep, list.items[i])) |_| {
            i += 1;
        } else {
            _ = list.swapRemove(i);
        }
    }
}

There is a nicer algorithm for O(n) filtering which preserves order (unlike this one), but it's a little more confusing on first reading.

Conclusion

This version of the code has solved all of our problems. We don't leak memory, but we didn't didn't complicate the memory mangement either, because we only do one allocation and return ownership of the memory to the caller.

The even more Zig-like thing to do is probably to take ownership of set and mutate it. It would save a copy in the function, keeping the algorithm implementation and the allocations/memory management separate.

set is const because it's only read. And we have ownership semantics that are clear to the reader.

Translating this code back in into, JavaScript gives us an implimentation that avoids potentional aliasing issues.

function intersectAllSlices2(config, set) {
    const out = Array.from(set);

    for (let config_set of config.sets) {
        removeNonexistant(out, config_set);
    }

    return out;
}


function removeNonexistant(list, to_keep) {
    let i = 0;
    while (i < list.length) {
        if (to_keep.includes(list[i])) {
            i += 1
        } else {
            list.splice(i, 1);
        }
    }
}

In modern JavaScript it's not an idiomatic implimentation. But it's an improvement on the original version with the aliasing issue.

But there's another advantage to this code that I haven't explictily mentioned: performance. removeNonexistant doesn't allocate intermediate values anymore, meaning we do O(1) allocations instead of O(n). In Zig this is a big deal. The original author's implimentation was doubtless inspired by the functional language approach: it closely matches an accumulator/reduce pattern, and doubtless the intention was that all of these functions would create new values in order to avoid mutating. This type of code is very common in functional languages. Most functions implicitly return copies, and in many of these languages arrays are not mutable at all.

Even in JavaScript, the functional-programming-approach is more idiomatic, I would argue. But since JavaScript's arrays are pass-by-reference, and the language doesn't have a type-system-level way to prevent mutations, JavaScript programmers frequently using libraries like Immer for immutability, or Lodash for point-free transformations. But sooner or later the limitations of these libraries are reached, and you need to reason about what code is allowed to mutate and when objects are allowed to be read, and ownership is the best way of doing that.

We quite literally talk about what React component "owns" a particular peice of state. It's really quite a similar problem to manual memory managment in Zig.

Post Script

It's a little bit of a peculiar choice by the original author to use ArrayList as the set datastructure. Zig has a BitSet, backed by a single integer, which is appropraite if you can constrain your values between 0 and say 128. ArrayList could be the right choice if you have a wide range of values but know that you never have more than a dozen or so at a given time. If you have an unbounded number of values, Zig's HashMap can function as a HashSet by simply not using the value field.