Ownership in Zig and JavaScript
I've long felt that Zig makes sense to me because of my background as a JavaScript programmer. Despite how different the languages seem to be, I found myself using a similar organizational strategy in both languages.
Some Instructive Zig Code
I read a piece of code recently by a Zig beginner. It looked something like this.
pub const Config = struct {
alloc: std.mem.Allocator,
sets: std.ArrayList([]usize),
};
pub fn intersectAllSlices(config: *const Config, set: []usize) []usize {
var incoming = set;
for (config.sets.items) |config_set| {
incoming = intersectSlices(config.alloc, config_set, incoming);
}
return incoming;
}
pub fn intersectSlices(alloc: std.mem.Allocator, a: []usize, b: []usize) []usize {
var out = std.ArrayList(usize).initCapacity(alloc, a.len) catch @panic("OOM");
for (a) |item|
if (std.mem.containsAtLeastScalar(usize, b, 1, item))
out.appendAssumeCapacity(item);
return out.toOwnedSlice(alloc) catch @panic("OOM");
}The idea is simple, we have a list of sets (stored in our config object), and a new set. We want to find all items that are in all of the existing sets and the new set. We have two functions, one which implements set intersection (intersectSlices), and one which intersects our many sets (intersectAllSlices).
There are two issues with this code which should jump out to an intermediate Zig programmer. First, we leak memory (never freeing the ArrayLists created in intersectSlices), and second, all of the slices should be const.
JavaScript Port
Now, my first step when working on a tricky programming project is to convert the code to my "native language," JavaScript. And of course doing this gets rid of both of the mentioned issues. JavaScript doesn't care about const types and it has a garbage collector.
function intersectAllSlices(config, set) {
let incoming = set;
for (const config_set of config.sets)
incoming = intersectSlices(config_set, incoming);
return incoming;
}
function intersectSlices(a, b) {
let out = [];
for (const item of a)
if (b.includes(item))
out.push(item);
return out;
}The Third Issue
But what's interesting to me is that both programs share a third issue: ownership, or the lack thereof. Let's look at intersectAllSlices in the JavaScript again, and ignore intersectSlices.
function intersectAllSlices(config, set) {
let incoming = set;
for (const config_set of config.sets) {
incoming = // ...
}
return incoming;
}Normally, the assignment in the loop happens, and a new array is assigned to incoming and returned. The issue with this code is that if config.sets is empty, the inner assignment is skipped, and set is returned.
Now, of course, this isn't always an issue. It's not a bug, per se, but it's a footgun.
Since arrays in JavaScript are passed by reference, a caller doesn't know if the value returned from intersectAllSlices is safe to modify or not.
const set = [1, 2, 3];
const result = intersectAllSlices(config, set);
// Function that modifies result
result.unshift();
// Operation that uses set
console.log(set);Is this code correct? What does it print? We don't know. It could print [1, 2, 3] or [2, 3] depending on the value of config.
Obviously bugs arising from these sorts of double-reference footguns are extremely common in any language with pass-by-reference semantics (JavaScript, Ruby, Python, etc). And there's an entire class of functional languages that make all objects immutable to prevent this.
Introducing Ownership
You can eliminate the footgun in this Zig program by using constant slices, replacing []usize with []const usize. This is likely what the original author intended.
But I think there's a better solution than making everything immutable, and that's introducing an explicit owner for every piece of data.
As a web-developer, I'm not qualified to give an expansive definition of ownership, but the basic idea is:
- Every piece of data has one owner (normally a function)
- The owner is responsible for transferring ownership or freeing the memory before the program exits
- (Normally) only the owner is allowed to mutate the data (others can only read)
- (Normally) the "flow" of ownership should be possible to determine statically (i.e. by reading the code, without knowing anything about the data)
Let's look at the code again and try to trace the ownership.
function intersectSlices(a, b) {
let out = [];
//...
return out
}intersectSlices creates a new piece of data, so be default it is the owner. When out is returned, the ownership transfers to the calling function.
function intersectAllSlices(config, set) {
let incoming = set;
for (const config_set of config.sets)
incoming = intersectSlices(config_set, incoming);
return incoming;
}incoming is the return value of intersectSlices, and we just said it passes ownership to the caller, which is intersectAllSlices. So intersectAllSlices owns incoming. Before it returns, it transfers ownership of that value to the calling function. That's all great. We're only reading config, so we don't need ownership of that.
But look at set. Again, what happens to set depends on the runtime values, so we can't create ownership rules that are static. This is bad programming practice, but it doesn't stop us from analyzing the ownership. Since set isn't modified by the function, we don't need ownership of it. So we have to decide, more or less arbitrarily, whether it's owned or not, and that will effect whether or not the return value is owned by the caller. Here's the 2x2 breakdown of the options:
| set is referenced | set is owned
-----------------------------------|--------------------
config.sets | return value | return
is empty | isn't owned | value owned
--------------------------------------------------------
config.sets | return value | return value owned;
has values | owned | set is "dead"The bottom right is interesting. Because we don't return set, but we do own it, that means it's without an owner after the function returns. The first rule of ownership was that everything has an owner. This means that set must not exist after the function returns. In a garbage collected language we don't have a way of explicitly destroying it, but as long as it's not read or written after the function returns, the ownership rules are satisfied.
The other results follow from the ownership rules pretty simply. For example, top left: since we return sets, but don't own it, the return value cannot be owned. Bottom left: we own the return value of intersectSlices, so we have to return ownership of it.
You can see how, if we allow for runtime ownership rules, we can satisfy the ownership requirements. And we end up with a clear contract for how the function can be used safely. But it's difficult to reason about, and it imposes non-local runtime-dependent ownership constraints on the anything that depends on the return value of the function.
Fixing the Issue
Let's rewrite this code from scratch in Zig, taking into account ownership.
We immediately have to decide how to handle ownership. We two options; taking ownership of set, and mutating it. Or, referencing set, creating a new (owned) value, and returning that. Since we're porting the original code, and the JavaScript code, I'm going to go with option 2, creating a new value.
pub fn intersectAllSlices2(config: *const Config, set: []const usize) []usizeYou can see, because the caller owns the new returned list, the return type is a []usize, but the passed-in set is const, since it's only referenced.
Now let's look at the whole function:
pub fn intersectAllSlices2(config: *const Config, set: []const usize) []usize {
const copy = config.alloc.dupe(usize, set) catch @panic("OOM");
var out: std.ArrayList(usize) = .fromOwnedSlice(copy);
for (config.sets.items) |config_set| {
removeNonexistant(usize, &out, config_set);
}
return out.toOwnedSlice(config.alloc) catch @panic("OOM");
}We copy set at the beginning, so we now own out and can use it as our return value, passing ownership to the caller with toOwnedSlice.
But what happened to intersectSlices? Well, intersect and removeNonexistant are really the same operation, but the later is going to mutate the set by removing everything not in the second set. If you're familiar with Zig, this change should make sense—Zig doesn't like unnecessary allocations. Instead of doing out = // newly allocated set, let's skip the allocations and just modify it.
out is owned by intersectAllSlices2, but it's okay for removeNonexistant to modify it. You could think about this in a couple of different ways: as an "exclusive mutable reference", or as ownership temporarily transferring to removeNonexistant and then back (even though the value doesn't move), or as removeNonexistant being a helper function that does the mutation but doesn't participate in the ownership system itself. But the jist is that it's okay because it has "permission" from the owner.
pub fn removeNonexistant(T: type, list: *std.ArrayList(T), to_keep: []const T) void {
var i: usize = 0;
while (i < list.items.len) {
if (std.mem.findScalar(T, to_keep, list.items[i])) |_| {
i += 1;
} else {
_ = list.swapRemove(i);
}
}
}There is a nicer algorithm for O(n) filtering which preserves order, but we don't need stable ordering since these are sets.
Because removeNonexistant doesn't allocate, that means the only allocation in our program is the initial copy, which is returned as owned memory to the caller. This means the caller needs to free it, and there's no more memory leak in this code.
Conclusion
This version of the code has solved all of our problems. We don't leak memory, but we didn't complicate the memory management either, because we only do one allocation and return ownership of the memory to the caller.
The even more Zig-like thing to do is probably to take ownership of set and mutate it. This would keep the algorithm implementation separate from a possible copy.set is const because it's only read. And we have ownership semantics that are clear to the reader.
Translating this code back in into JavaScript gives us an implementation that avoids potential aliasing issues.
function intersectAllSlices2(config, set) {
const out = Array.from(set);
for (let config_set of config.sets) {
removeNonexistant(out, config_set);
}
return out;
}
function removeNonexistant(list, to_keep) {
let i = 0;
while (i < list.length) {
if (to_keep.includes(list[i])) {
i += 1
} else {
list.splice(i, 1);
}
}
}In modern JavaScript it's not an idiomatic implementation. But it's an improvement on the original version.
But there's another advantage to this code that I haven't explicitly mentioned: performance. removeNonexistant doesn't allocate intermediate values anymore, meaning we do O(1) allocations instead of O(n). In Zig this is a big deal. The original author's implementation was doubtless inspired by the functional language approach: it closely matches an accumulator/reduce pattern, and the likely intention was that all of these functions would create new values in order to avoid mutating.
Even in JavaScript, the functional-programming-approach is more idiomatic, I would argue. But JavaScript's arrays are pass-by-reference, and the language doesn't have a type-system-level way to prevent mutations, so it's common to end up using libraries like Immer for immutability, or reasoning about ownership.
Reasoning about ownership is a very powerful tool. In Zig it lets you confidently write mutating code while avoiding memory leaks or use after free bugs. In JavaScript, it lets you avoid subtle, but potentially-disastrous, aliasing bugs.
People say manual memory management is very intimidating, but utilizing a strict ownership model makes it straightforward, and helps you avoid other problems at the same time. And you may already have practice with ownership modeling, without realizing it, if you have a lot of experience in JavaScript.
Post Script
The original author chose to useArrayListas the set datastructure, but there are other options. Zig has aBitSet, backed by a single integer, which is appropriate if you can constrain your values between 0 and say 128.ArrayListis the right choice if you have a wide range of values but know that you never have more than a small number at a given time. If you have an unbounded number of values, Zig's HashMap can function as a HashSet by simply not using thevaluefield. That would give you O(1) presence checks.