Earl: a framework for scalable reinforcement learning research

2025-03-03T00:00:00-08:00

In this post I will briefly describe Earl, a reinforcement learning (RL) framework I wrote that enables scalable distributed training across multiple devices, and discuss some of the things I learned along the way.

Earl implements the two architectures described in “Podracer architectures for scalable Reinforcement Learning”, which were used at DeepMind to scale training to very large batch sizes across many chips. Note these are not neural network architectures, but distributed RL architectures that can be used to train models that internally may use any neural network architecture. To prove it is usable, I used Earl to implement the R2D2 algorithm as described in another DeepMind paper “Recurrent Experience Replay In Distributed Reinforcement Learning”.

Background

To provide context, I’ll briefly summarize the Podracer architectures paper. If you know it, feel free to skip this.

In contrast to other machine learning paradigms, online RL involves an agent and an environment, and the training data is generated on the fly from their interactions. The paper describes two architectures: Anakin and Sebulba¹. Anakin is used when an environment is compatible with jax.jit, and Sebulba is used otherwise. In both of the architectures, the agent is implemented in JAX and is compatible with jax.jit. If you’re not familiar with JAX, see my introduction. Basically, code that is run under jax.jit is optimized by a compiler and run on any supported device (e.g. GPU) without further involving the Python interpreter.

In Anakin, one can have the entire training loop (agent + environment interaction, loss function and optimization) happen under jax.jit and thus run on a device (e.g. GPU) without going back to the Python interpreter. In terms of writing a performant training loop, in some ways this is even easier to deal with than normal (supervised) machine learning since one does not need to copy any data from the host to the accelerator. Scaling this across multiple devices is trivial using JAX.

Below is a figure I created that is analogous to the paper’s figure 3 (see below) that illustrates Anakin-style RL. Notice that there is nothing running on the CPU! This figure may be misleading because the arrows don’t necessarily signify data being copied. It’s just data that is produced by one function being an argument to another function.

Although jax.jit-compatible environments are gaining more adoption in research, there are still many environments that can’t run under jax.jit that researchers care about. The Podracer solution to training on these at scale is called Sebulba. Sebulba involves splitting agents into actor and learner as shown in Figure 3 from the paper. Note that in this figure the arrows do signify data copies.

One of my main goals with Earl was to have a single agent implementation that could easily be run in either architecture. This is in contrast to what the team behind the Podracers paper did; the paper suggests they implemented agents twice, once for Anakin and once for Sebulba.

The “pod” in “Podracers” is an allusion to a collection of TPUs that are all connected with high bandwidth. Later in this post I will discuss what advantages TPUs actually provide.

Gymnax Loop: Earl’s Anakin

Earl’s implementation of Anakin is GymnaxLoop. Gymnax is a collection of RL environments implemented in JAX with a common interface, and Earl adopted that interface because it seemed more widely used than the alternatives. The GymnaxLoop implementation is mostly straight-forward, so here I only discuss some of the trickier things that I solved.

Avoiding recompilation

The first time a jax.jit function is run, it is compiled, which is slow. Unwanted recompilations are a performance foot-gun, so by default Earl will fail if the code is recompiled. When I enabled this failure, I learned that my Gymnax Loop was recompiling the main loop (act and learn) because the types of some part of the environment state were changing. After digging into it (made difficult due to a JAX bug that I reported) I discovered that the change was that the initial state (returned by env.reset()) had weak_type=True on some arrays, but calls to env.step() changed the weak_type to False. GymnaxLoop fixes this by setting weak_type=False on all arrays in the environment state before running. This avoids recompilation and thus speeds up training significantly.

Gymnasium Loop: Earl’s Sebulba

Earl’s implementation of Sebulba is GymnasiumLoop. Gymnasium is a widely-used interface for RL environments, which generally are not compatible with jax.jit. The GymnasiumLoop has a much more complex design, was much trickier to implement correctly and was also harder to optimize for performance.

Here’s a diagram showing the system architecture. Sorry the text is a little small. You can open the image in a new window to zoom in.

And here are results from a test showing linear scaling on up to 6 learner devices (TPU v2 cores):

Before going further into details, why is all this complexity needed in the first place? That is never explicitly addressed in the podracers paper. I think the key thing is that in a naive loop of env.step(), agent.act(), the device (GPU) will be idle during env.step() and the CPU will be idle during agent.act(). So we can get much better throughput by double buffering: have one batch of actions being computed by the agent at the same time that a batch of observations is being computed by the environment. But in order to take advantage of this double buffering, you need the learning to be able to learn from batches of trajectories from different sets of environments that are delivered out of order. And that basically implies an actor-learner split, and once you’ve split the agent thus, one can get further throughput gains by scaling the number of actors and learners independently. And that’s basically the architecture: separate sets of actors and learners, communicating asynchronously, scaled independently.

OK, now some details.

Agent state organization

A key design challenge was creating a flexible agent architecture that could work efficiently in both Anakin and Sebulba paradigms. Earl has two main base classes: AgentState and Agent. These are structured to enable Sebulba-style training while attempting to leave the user lots of freedom (they’re also used for Anakin-style training but they’re way too complex if that’s the only thing you need).

AgentState has the following fields:

Actor. This is read and written by the actor. This is also read by the learner when calculating the loss. In agents that use recurrent networks, this includes the recurrent hidden states.
Nets. This holds the neural networks This is read by the actor. It is read and written by the learner. Anything that needs a gradient computed needs to be in the networks.
Opt. Anything other than nets that also needs to be updated when optimizing (i.e. updating the networks). This is where optimizer state belongs.
Experience. This is state that is based on the trajectories accumulated by actors and sent to the learners. For agents that use experience replay, this contains replay buffers.

And the key methods in Agent are:

act(self, actor_state: _ActorState, nets: _Networks, env_step: EnvStep) -> ActionAndState[_ActorState]
update_experience(self, experience_state: _ExperienceState, actor_state_pre: _ActorState, actor_state_post: _ActorState, trajectory: EnvStep) -> _ExperienceState
partition_for_grad(self, nets: _Networks) -> tuple[_Networks, _Networks]
loss(self, nets: _Networks, opt_state: _OptState, experience_state: _ExperienceState) -> tuple[Scalar, _ExperienceState]
optimize_from_grads(self, nets: _Networks, opt_state: _OptState, nets_grads: PyTree) -> tuple[_Networks, _OptState]
shard_actor_state(self, actor_state: _ActorState, learner_devices: Sequence[jax.Device]) -> _ActorState

The method signatures and AgentState structure force algorithms to be implemented such that GymnasiumLoop can run any agent in a scalable manner.

Implicit double buffering

When I first thought about double buffering I thought I would write code that used two CUDA streams to overlap work. I was surprised to learn that JAX does not expose CUDA streams or any similar abstraction. Upon re-reading the Podracers paper, I noticed they wrote:

To make efficient use of the actor cores, it is essential that while a Python thread is stepping a batch of environments, the corresponding TPU core is not idle. This is achieved by creating multiple Python threads per actor core, each with its own batched environment. The threads alternate in using the same actor core, without manual synchronization.

So I tried just having multiple threads use the same device, and lo-and-behold I got a huge speedup! Looking at a profile in Nvidia Nsight Systems revealed that under the hood, JAX had analyzed the computations coming in from the different threads, determined they were independent, and scheduled them on separate CUDA streams (really separate CUDA graphs). This is in contrast to PyTorch which by default puts all work on a single stream and requires the user to specify another stream if desired.

Below you can see the Nsight Systems UI showing the different CUDA graphs at the top and the agent.step() overlapping with the env.step() at the bottom. The two graphs of interest are 12 and 15. The profile records which thread launched each graph which confirmed that the two threads were launching separate graphs.

Batching and sharding data

The paper suggests that experience data is copied from the actors to the learners one batch at a time. This seems quite inefficient. I instead break up acting into cycles of configurable length, and copy one cycle’s worth of batches at a time from the actor to the learners (i.e. num_envs * steps_per_cycle units of observations, actions, rewards, etc).

The paper does not address the details of how the data is stored and retrieved for replay. In Earl, the user specifies num_envs, which for GymnasiumLoop is the number of environments per actor thread. There are two actor threads per actor device. Each actor thread shards the trajectory and actor state evenly across the learner devices. Thus when the framework calls Agent.update_experience() on the learner device, the experience data has batch size = num_envs / len(learner_devices), which must be an integer (i.e. must divide evenly). The Agent is free to store and replay that experience in whatever way it chooses. For my R2D2 implementation, to keep things simple, I store the experience using that same batch size (num_envs / len(learner_devices)) and then replay some batch size that is an integer multiple of that batch size.

One thing not mentioned in the paper but which is obviously necessary for many algorithms is copying of actor state to the learners. For example, in R2D2 the LSTM hidden states at the beginning of a trajectory are needed by the learners. The framework can take care of properly distributing the observations, actions and rewards, but the details of the actor state depend on the particular agent implementation, so users of GymnasiumLoop have to implement Agent.shard_actor_state(actor_state, learner_devices). Depending on the algorithm, some elements of the state will be sharded evenly to go along with the trajectory data, while other elements will be replicated across all learner devices or not copied at all.

Performance tuning

In GymnasiumLoop, ideally all accelerator devices are being fully utilized. Getting there requires a lot of tuning. Some of the knobs available for tuning and what they do:

Num_envs: the number or environments per actor thread (there are 2 actor threads per actor device). Increasing this will increase CPU usage during env.step() and increase actor device (e.g. GPU) usage during agent.act(). It will increase CPU memory usage (for the environment state). It will also increase memory usage on the actor device, moreso if the actor maintains per-environment state (e.g. recurrent hidden state).
Num_off_policy_optims_per_cycle: the number of times Agent.loss and Agent.optimize_from_grads is called between waiting for new experience data from the actors. Increasing this will increase learner device usage. It may cause the actor threads to block (and thus make the actor devices and CPUs idle) if the queue for experience data is full (currently the queue has a max length of 2). Increasing it will also make the algorithm more off-policy, since it does more updates on experience that was produced by older policies.
The number of actor devices and learner devices. More learner devices effectively increases batch sizes and thus can help training be faster or more stable. More actor devices increases the rate at which new experience trajectories are made available to the learners. If the number of environments on a machine is limited by CPU cores or CPU memory, increasing the number of actor devices effectively reduces the actor batch size (num_envs).

The metrics that are currently exposed on every run are the cycle time for the learners (which includes getting new experience and then some number of loss + optimization steps), and the time the learners spend waiting for an actor to enqueue experience. Because JAX arrays are materialized asynchronously, the actor thread’s call to jax.device_put_sharded() will return before the data has actually been copied to the learner devices. Thus the learner device will be able to successfully retrieve experience from the queue, but computation may block waiting for the data to be copied. I don’t think there’s a good way to expose the exact amount of time spent waiting for copies during normal execution (doing so would require putting in barriers that could hurt performance). So the process I used for tuning performance was:

If learner device utilization is not high, try tweaking the above knobs to get it up.
When / if that didn’t succeed, use a profiler (I used NVidia Nsight Systems). This made it fairly easy to see when computation was waiting on copies.

Performance footgun: implicit vs explicit host->device copies

Using the profiler I was able to spot a blocking host->device copy in the inner loop of the actor cycle that was caused by something like:

for _ in steps_per_cycle:
  observation, done, reward = env.step(action)
  observation, done, reward = jax.numpy.array(observation), jax.numpy.array(done), jax.numpy.array(reward)
  action = agent.act(observation, done, reward)

It turned out that the explicit conversion from Numpy to JAX arrays was much much slower than just passing the Numpy arrays directly into Agent.act. I confirmed the issue with this simplified example:

import numpy as np
import jax

@jax.jit
def add(a, b):
 return a+b+1

def lazy():
 a = np.ones((128, 128))
 b = np.ones((128, 128)) * 2
 return add(a, b)

def eager():
 a = jax.numpy.array(np.ones((128, 128)))
 b = jax.numpy.array(np.ones((128, 128)) * 2)
 return add(a, b)

The lazy function takes 0.3 milliseconds and the eager takes 1.1 (3.7x slowdown) on a Google Colab instance with a T4 GPU.

Under the hood, the eager function launches 3 CUDA kernels, one for each array copy and one for the addition, returning to the Python interpreter between each. The lazy function goes into CUDA only once.

Batching Gymnasium environments

In the podracers paper section in Sebulba they write: To minimise the effect of Python’s GIL, when stepping a batch of environments in parallel, each Python actor-thread interacts with a special batched environment; this is exposed to Python as a single environment that takes a batch of actions and returns a batch of observations; behind the scenes it steps each environment in the batch in parallel using a shared pool of C++ threads.

The functionality described in the paper is provided for some environments by EnvPool. For Gymnasium environments not supported by EnvPool, Earl will apply Gymnasium’s built-in vectorization which uses Python multiprocessing to run multiple copies of the environment in parallel. This is much much slower than EnvPool, and one fun thing I had to work around was that each subprocess would try to pre-allocate most of the GPU memory on startup (this happens whenever you import jax). I worked around this by setting an environment variable telling JAX to only use the CPU in those environment subprocesses.

Potential improvements

Pmap -> automatic parallelism

When the Podracers paper was written, jax.pmap was the recommended way of parallelizing computation across multiple devices. Since then, the JAX team has developed “automatic parallelism” and encourages its use over pmap. The basic idea is that the programmer shards (or replicates, which in JAX is called a type of sharding) arrays across devices, and the compiler and runtime automatically figure out where computation should happen and where function outputs should go.

I prototyped an implementation of GymnaxLoop that used automatic parallelism before throwing it away and settling on the explicit Pmap approach. The reason is that I couldn’t convince myself that sampling randomly from a replay buffer wouldn’t result in extra cross-device copies and uneven workloads. Earl is currently entirely agnostic to how an agent manages its experience state (which will include the replay buffers). Experience replay could be implemented in a way that is compatible with automatic parallelism (I believe the main constraint is that buffer has to be sized such that it can be sharded evenly across devices, and reads and writes are balanced across all devices), but guaranteeing this would require the framework to be more opinionated about how replay buffers are managed.

If I were to do this, I would look to DeepMind’s Acme for inspiration. It is extremely prescriptive about how experience state is managed, and I think a similar design could result in something that’s guaranteed to be performant with JAX’s automatic parallelism.

Multiple losses

Some algorithms compute different loss terms for different subsystems. Earl doesn’t currently support this, but it wouldn’t be too hard to add.

Scaling to multiple machines, or how special are the “pods” really?

Earl currently only supports single-machine training. Supporting multi-machine would be as straightforward as adding a call to jax.distributed.initialize() in the training script. However, when scaling to multiple machines, network bandwidth becomes a critical factor. Let’s analyze how bandwidth affects training throughput and compare TPU pods with modern GPU clusters.

The “pod” in the “Podracers” article is a reference to a Google Cloud TPU pod, which is a group of TPU chips that have high bandwidth interconnections. Let’s analyze how bandwidth affects training throughput. Both Anakin and Sebulba have to send gradients between all learner devices before every optimizer step, and this latency cannot easily be hidden by overlapping that work with other work (unlike the transfers from actors to learners, which can be overlapped with both acting and learning). The amount of data that needs to be meaned is: (bits per gradient) x (num parameters).

Let’s say each device has bandwidth of R bits / sec and the gradients take S bits. Assuming the mean is calculated and sent back using a reduce-scatter and then all-gather, the time taken is:

\frac{2S}{R}

Now let’s try to get some sensible values of R and S. Most online RL research uses relatively few parameters compared to modern LLMs (e.g. Dreamer v3 XL has 300 million parameters, the unusually large Gato has 1.2 billion). To work an example, let’s say we use 16 bits per gradient x 1 billion parameters = 16 Gbits. The TPU v6e has R = 3584 Gbps of inter-chip interconnect bandwidth, which gets us:

\frac{2 \times 16}{3584} = 0.009s

To answer how special TPU pods are, let’s compare this to NVidia GPUs. NVidia’s GB200 can connect up to 72 GPUs at 1800 Gbps. The same reduction would take

\frac{2 \times 16}{1800} = 0.018s

Roughly twice as long, but in order to determine how much of an impact this makes on training throughput we’d need to look at a particular example which depends heavily on hyperparameters. It seems the TPU networking is still higher bandwidth, but for workloads that fit within an NVLink switch, the impact on training throughput may be quite small.

Conclusion

Years ago only DeepMind and OpenAI could do distributed RL at scale. Today, thanks to the libraries, APIs, on-demand cloud computing, and knowledge that is available, it’s within reach of a very small team (like me!).

Acknowledgements

I started Earl while working at the Astera Institute, though I didn’t implement distributed training until after I left. I thank Jed McCaleb for agreeing to let me open-source it. My coworkers at Astera contributed to Earl early on: Andrew Grebenisan, Mick van Gelderen and Eric Alt.

I use these terms for consistency with the paper, which in no way should be read as my endorsement of The Phantom Menace. Though I did enjoy the Racer game. ↩

Starflate: Deflate decompression in C++23

2025-01-31T00:00:00-08:00

In this post I describe some things I learned while working on Starflate, an implementation of Deflate decompression in C++23 that I wrote with my friend Oliver Lee.

Deflate is a compression codec used in GZip, Zip, PNG and other formats. I wanted to get hands-on with GPU programming and I decided implementing Deflate decompression would be a fun way to do that. After finishing the CPU-only implementation, I realized that there is no way to efficiently parallelize it, so I will have to find something else to use to learn GPU programming. But along the way I did learn quite a bit about compression and C++.

Deflate decompression

I think this diagram does a pretty good job of showing the different layers in the Deflate compression algorithm:

Figure 4 from Takafuji et. al, 2022

The innermost layer is LZSS, in which the input is a series of either:

A “literal”, meaning just copy this byte to the output, or
A length and backwards-distance pair (l, d), meaning copy l bytes starting from output[-d] to output.

The second layer is an encoding scheme for the length and distance pairs that doesn’t seem to have a name, but is shown in the diagam as “deflate” format. The deflate standard defines a code table for distances and another for lengths. This steps to decode goes something like:

Look up the code in the table. This gives a base value and a number of extra bits to read from the input.
Read those extra bits from the input, interpret them as an integer.
Add the integer to the base value.

The outermost layer is Huffman coding. I won’t do a better job than Wikipedia of explaining it, but basically it’s a provably optimal (as in maximally compact) prefix-free coding scheme (meaning the code for any symbol is not a prefix of the code for any other symbol).

Finally, there is the added complexity that the Huffman code tables themselves can be included in the compressed data, and they are encoded using a scheme similar to the second-layer (“deflate” coding) scheme (but slightly different).

Starflate design

The core implementation of decompression is 391 lines of code (excluding comments and blank lines), and I think it’s relatively readable. However, there are another ~1300 lines of code in helper libraries we wrote for dealing with bit streams and Huffman coding. These helper libraries allowed the main code to stay quite short and readable.

bit_span

bit_span is like std::span in that it is a non-owning view of a contiguous extent of the same type of data. Unlike span, bit_span allows its users to iterate over individual bits, even though the underlying data is stored as bytes.

huffman::table

huffman::table is a Huffman code table. For ease of testing it has a bunch of different constructors, but the only one used in decompression is the one that takes a range of pairs of (symbol range, bitsize). Huffman coding uses prefix-free codes, meaning that we decode the input one bit at a time and we’re done as soon as we find the bit pattern in the table. Internally the table stores things sorted lexicographically, which allows for efficient decoding by keeping track of where we are in the table in between attempts to decode the bits. In an attempt to be idiomatic C++, the table exposes iterators with the standard begin() and end() methods. The main use of the table class is in decode_one, the pseudo-code for which is:

def decode_one(huffman_table, bits):
   table_pos = huffman_table.begin() # iterator
   current_code = 0
   for bit in bits:
      current_code = (current_code << 1) | bit
      std::expected<table::iterator, table::iterator> found = huffman_table.find(
        current_code, table_pos)
      if found:
        return found->symbol
      table_pos = found.error() # uses expected::error to hold the next iterator position
      if table_pos == huffman_table.end():
        return error()
  return error()

C++ features / patterns

A few C++ features or patterns I learned along the way. Thanks to Oliver for teaching me all these (and more that didn’t stick!).

constexpr

My biggest C++ lesson learned is that a large fraction of the language and library features are compatible with constexpr, meaning they can be evaluated at compile time. While there are potential runtime performance benefits to this, it’s also cool that this guarantees the code contains no undefined behavior. That is, this is a way to convert potential runtime errors into compile-time errors. This is the one feature of C++ that I actually missed when writing Rust recently.

std::expected

Added in C++23, std::expected contains either an expected or an error value. It’s a sane way of propagating errors and we used it extensively. This is one of those things that I didn’t notice was missing from the language when I worked at Google because Google had its own version. Actually the standard library version is better because both the expected and error types can be templated, which we took advantage of for huffman::table::find’s return type.

The overload pattern: pattern matching

You can combine std::variant, std::visit, and the overload pattern to get something like Rust’s pattern matching. We used this here to dispatch to different code paths depending on wheter we decoded a literal byte to be copied to the output, or length of previous output to be copied. The syntax for it is terrible though. This example from C++ Stories is a good one:

template<class... Ts> struct overload : Ts... { using Ts::operator()...; };

std::variant<int, float, std::string> intFloatString { "Hello" };
std::visit(overload  {
      [](const int& i) { std::cout << "int: " << i; },
      [](const float& f) { std::cout << "float: " << f; },
      [](const std::string& s) { std::cout << "string: " << s; }
    },
    intFloatString
);

Template deduction guide

A template deduction guide is some code that one can add to a templated function or class that tells the compiler how to fill in template arguments. This can make using the templated function or class much more readable.

This example from cppreference is a good one:

// declaration of the template
template<class T>
struct container
{
    container(T t) {}

    template<class Iter>
    container(Iter beg, Iter end);
};

// additional deduction guide
template<class Iter>
container(Iter b, Iter e) -> container<typename std::iterator_traits<Iter>::value_type>;

// uses
container c(7); // OK: deduces T=int using an implicitly-generated guide
std::vector<double> v = {/* ... */};
auto d = container(v.begin(), v.end()); // OK: deduces T=double

Missing feature: iterator_interface

A lot of the standard library exposes and operates on iterators, so it’s nice to also do so when writing custom data structures. Quoting the proposal for adding std::iterator_interface: “Writing STL iterators is surprisingly hard. There are a lot of things that can subtly go wrong. It is also very tedious, which of course makes it error-prone.” We needed a couple of iterators (one for bit_span and one for huffman::table), so we added a simple version of the iterator_interface. There is an implementation in Boost, but we wanted to avoid any external dependencies.

Setting up C++ is horrible

C++ takes an insane amount of set-up to get a repository that has what are extremely easy in other languages. We set the following up, and none of it is really standard or easy to do:

Build system. One has to choose between Make, CMake, Meson, Bazel, etc. We chose Bazel because it’s good and we’re used to it, but it takes a lot of work to set up, it’s poorly documented, and less common combinations of features (like C++ test coverage with Clang) have been broken.
Hermetic toolchain, meaning inside the repo we define what versions of Clang, GCC, etc we want to use, rather than relying on whatever is installed on the system.
Sanitizers. E.g. thread sanitizer, address sanitizer, undefined behavior sanitizer. These are compilation modes that instrument the code and fail if the code does something bad. Address sanitizer and undefined behavior sanitizer aren’t needed for most other languages, but I think it’s pretty insane to write C++ without them.
Static analysis (AKA linting). Basically turn on all the compiler warnings and treat them as errors (pretty insane that this is not the default for most of the warnings), and also run clang-tidy. Running clang-tidy through bazel is not straightforward but Oliver figured it out.
Autoformatting. Again insane that this is not the default, and one needs to do extra work to get it configured in editors and enforced in CI.
Bringing in third party dependencies is horrible, and you need at least some third party dependencies because the standard library doesn’t include a unit test library.

Assembling an infrastructure for machine learning research

2025-01-27T00:00:00-08:00

While working on machine learning research at the Astera Institute¹, I led a team that assembled a system that enabled researchers to quickly and easily run experiments that used up to a full datacenter’s worth of GPUs. I intentionally wrote “assemble” rather than “build”, because the system mostly consists of off-the-shelf components. The challenge was in digging through the huge number of options for each possible piece of functionality, selecting appropriately, gluing things together into a working system, and designing an easy but powerful interface. I’m proud of how little code we wrote relative to how much functionality the system provides.

Let’s start with the core functionality that the system provided:

Takes code from the user’s git branch.
Runs the code on a cluster with a specifiable number of trials running in parallel, and with each trial using a specifiable number of GPUs.
Executes a search using a specifiable search space and algorithm, or multiple trials with different random seeds to get a distribution of results for a fixed set of hyperparameters.
Persists the results and makes them viewable in a web UI, or via CSV files.

Now let’s go into how this is all implemented!

The components

Hardware

We rented hardware from Voltage Park, who at the time only offered exactly one thing: bare-metal servers running Ubuntu with 8 Nvidia GPUs. Because of pre-existing contracts, we didn’t have any choice in hardware or cloud providers. This constrained the design space for some of the other parts of the system.

Kubernetes on Talos: Hardware abstraction and workload orchestration

Just as an operating system abstracts over CPUs and RAM on a computer and manages the life cycles of processes, Kubernetes abstracts over all the hardware in a cluster and manages the life cycles of containers.

We ended up using Talos, a Linux distribution that includes Kubernetes. Overall we were really happy with that choice. It’s well-designed, well-documented, and well-supported.

The journey

While there are alternatives, Kubernetes is by far the most popular system in this category and that brings with it a huge ecosystem of tools, services, patterns and documentation, so for me it was an easy choice.

The difficult thing was figuring out how to run it. Big cloud providers like Azure or CoreWeave provide a managed Kubernetes service. Because we were tied to Voltage Park, managed Kubernetes services weren’t an option. There are many ways to run Kubernetes on your own. I initially picked Kubespray because it was mentioned in the official Kubernetes documentation and it was built on Ansible, which we were already using. While I did successfully run a cluster using Kubespray, I was not satisfied:

Creating or modifying a cluster is very slow. Like 30 minutes to apply a configuration change to a 6 node cluster.
Doing normal things would often fail and leave the cluster in some unknown and probably invalid state that was very difficult to recover from. Because Kubespray makes changes to the underlying node OS, but it doesn’t take full responsibility for the OS the way Talos does, getting back into a good state would require reinstalling the OS and then re-running Kubespray, which would take over an hour.
We had mysterious and hard to debug issues with the GPUs becoming inaccessible, which we worked around by rebooting the nodes.
There is no paid support option and I couldn’t resolve the above issues using the free community-provided support (including the patchy documentation).

In contrast, Talos:

Takes much less time to apply configuration (sorry I don’t remember the timing, but the fact that I don’t remember means it was not a big deal!).
Installs the operating system in a pre-configured state such that it is ready to be part of the Kubernetes cluster, and the OS is immutable (read-only after installation), so it is much less likely to end up in weird states.
Had no such flaky GPU issues.
Has paid support options and excellent response times to community-reported bugs.

The main challenge we had with Talos is that our cloud provider did not give us a way to install a custom OS. After first trying to run it inside a VM inside the Ubuntu host, we ended up finding a way to overwrite Ubuntu with Talos from within Ubuntu! This meant we could run Talos on bare metal.

Distribution Registry: Container image hosting

Container images are the unit of distribution for code that runs on Kubernetes. A container registry is a service that stores and allows clients to upload and download container images. There are many options for cloud-hosted managed container registries, but we wanted our images to be stored on the same local network as our Kubernetes nodes in order to maximize bandwidth when downloading (AKA “pulling”) images. So we ran Distribution Registry inside our cluster.

The journey

The main difficulties with self-hosting the registry was in configuring network access. Our requirements:

The registry is accesible via HTTPS to pods inside the cluster. This is needed because Katib (discussed below) fetches image metadata directly from the registry and there is no easy way to tell it to connect without HTTPS.
Image pushes and pulls do not go through Tailscale, since that reduces bandwidth and our images are pretty large.

We ended up with the following solution:

A Tailscale name is used for the registry. We configured Tailscale to automatically generate an SSL certificate so connections over HTTPS work.
We configured our cluster’s DNS to forward requests for .ts.net domains to an in-cluster Tailscale DNS. So connections from inside a pod inside the cluster also go through Tailscale.
We configured our containerd, which is responsible for pulling images when starting containers, to treat the registry’s Tailscale domain name as an alias of the registry’s in-cluster .svc.cluster.local name, thus bypassing Tailscale encryption and maintaining fast image pulls.
We configured Kaniko (discussed below) to push to the registry through its .svc.cluster.local name, thus bypassing Tailscale ant maintaining fast image pushes.

Kaniko: In-cluster image building

Kaniko takes in a git repository URL and revision and a path to a Dockerfile within the repository, and it builds an image according to the Dockerfile and pushes it to our Distribution Registry. This is how a user’s code gets into a container image.

The journey

We started building images locally on the user’s computer and then pushing the image from there to the cluster. This worked, but due to some large dependencies (e.g. PyTorch alone is over 900 MB), any push of the image layer that contained the dependencies was very slow. Since the actual code being modified (i.e. the git repo) was much smaller, it made sense to upload only that from the user’s computer to the cluster, and then build the image in the cluster and then the push to the registry would happen over a fast local connection. This does require users to commit and push their code to git before starting an experiment, but that is a good practice anyways.

Katib: Multi-trial experiment orchestration

Katib is a system for running distributed hyperparameter search on Kubernetes. A “search” over different random seeds can be used as a way to get a distribution of results for a fixed set of hyperpartmeters. Katib is very flexible but that flexibility means it requires a lot of configuration for each experiment. We were able to simplify the user experience dramatically through a mix of automation and convention. The main things that Katib needs to know are:

The search space. We require the user to write this in a YAML file.
The metrics to optimize. We require the user to write this in the same YAML file.
How to run an individual trial. E.g., container image to use, how many GPUs it needs. The Launch tool (described below) handles this automatically.
How to pass in hyperparameter values for a trial. The Launch tool handles this automatically, by assuming the user’s code follows the convention of using Draccus (described below) or something compatible for command line parsing.
How to extract metrics from a trial. The Launch tool handles this automatically, by assuming the user’s code follows the convention of using writing its metrics in Tensorboard format to the path specified in the --tensorboard_dir command line arg.

The journey

Before Katib, we tried Ray Tune. The things we liked less about Ray Tune than Katib:

The APIs and documentation are a mess. In contrast, while Katib’s documentation is very incomplete, the repo contains lots of examples that are pretty instructive, and the APIs are much more intuitive.
Ray Tune requires writing an imperative Python file using the aforementioned confusing APIs for every search. It’s much easier to check the validity of a static YAML file that configures Katib than to check for all the ways Python code might be wrong.
Ray Tune seemed to require more restructuring of the researcher’s code.
The only way to track progress is via terminal output (whereas Katib has a nice web UI), and even totally correct use of Ray Tune results in massive amounts of warnings and useless messages being printed.

Draccus: Training code configuration specification

Draccus is a simple Python library for defining and parsing configuration using dataclasses. The key thing that our system requires of the training code used for a trial is that a hyperparamter named “foo” is accepted and parsed via the command line flag --foo. This lets the Launch tool translate mechanically between the search space the user wrote in YAML and the command line for a trial.

The journey

The main alternative I considered was Hydra. Hydra seems to have a superset of the functionality in Draccus but the added complexity of all of the options didn’t seem worth the benefits. Due to the modular system design, it would be easy to switch later if the team decides Hydra is needed.

MLflow: Experiment metric tracking

While Katib tracks the metrics being optimized in a search, there are many other metrics that can be useful to analyze, and having a UI to visualize metrics throughout a trial and compare them across experiments is really useful. Storing artifacts like videos of an agent interacting with an RL environment is also key for understanding training progress. For this we used MLflow Tracking, a service to track metrics and store artifacts.

The journey

The main alternative I considered was Weights & Biases. While they have very similar sets of features, we ended up choosing MLflow because:

It has documented HTTP APIs, meaning one can interact with it from any language. I didn’t want to be forced to use Python for all the tooling that might want to interact with our experiment metrics.
It can be much cheaper. Databricks doesn’t make this clear, but if you provide your own storage (e.g. S3 bucket), they will host MLflow tracking for free. Or you can self-host for free.

Once we had settled on MLflow, the main challenge was finding and enforcing a convention on how to organize experiments and runs so people can find what they need. While I’m not confident this is the best solution, we ended up writing a small wrapper over the MLflow Python client that sets the experiment name and run name to match the Katib experiment and trial name. It gets this Katib metadata from environment variables set by the Launch tool. This at least makes it easy to go between the two systems.

Tailscale: Secure remote access

In order to create and monitor experiments, users need to have access to Katib and other services running in our cluster. We used Tailscale for this. It provides network encryption and DNS like a VPN but all the connections are peer-to-peer rather than forcing everything through a single VPN server. It works great and integrated seamlessly with our Google workspace accounts.

Launch: User CLI that glues it all together

The only part of this that we wrote ourselves is a tool called Launch. It is open source and written in Rust. Launch glues everything together. It takes in:

A YAML file specifying the search space and to-be-optimized metrics.
A --gpus flag, which specifies the number of GPUs per trial.
A command to run for each trial.

And it:

Triggers a build of a container image of the current git branch via Kaniko.
Constructs a full Katib experiment spec. In addition to the info from the user’s YAML file, it tells Katib to pass hyperparameter values via command line args according to the conventions (described above in the Katib section), and it adds the --tensorboard_dir arg.
Creates the Katib experiment.
Prints URLs of Katib and MLflow UI pages for the experiment.
Polls the cluster to check that the experiment starts and runs a trial succesfully.

The journey

The main question I struggled with was what language to use to implement the tool. Due to the available libraries for interacting with Kubernetes and Katib, Go was the obvious choice. The factors in favor of Rust were pre-existing expertise on the team and me thinking it would be more fun in Rust. The availability of libraries was almost decisive in Go’s favor until we discovered we could use the OpenAPI Generator to generate Rust client libraries for Kubernetes and Katib. Compared to Go, some of the nicest things about Rust are the power of the Serde library for deserializing configuration files and the error handling syntax (writing foo()? is so much nicer than if err := foo(); err != nil { return err } ).

System diagram

Notes:

In reality all of this could be running on a single server, or distributed as shown, or something in between. Kubernetes handles the scheduling dynamically.
Not shown, but in addition to the depicted MLflow upload, user code also writes metrics to a local directory in Tensorboard format, which is what Katib monitors.

What could be improved

The biggest thing that I wish I could have improved before I left was the latency of building and pushing container images. While Kaniko is supposed to support caching, we weren’t able to get it working, so every time a user launches an experiment Kaniko would take a few minutes to rebuild the entire image (unless they didn’t change any code at all, in that case we would re-use a previously built image). The solution I wanted to try was to build and push the images using Bazel, which has many options for caching and would also allow us to have very fine-grained control over the image to optimize it for build speed. In particular, Bazel should make it possible to have one image layer per python package in our dependencies, so if a single dependency changes we wouldn’t need to rebuild and push a single huge layer that has all of our dependencies.

Another thing I wanted to do was to modify the Katib UI to allow adding a link from Katib to MLflow. This is hopefully a simple change.

Finally there are things which we didn’t implement only because we didn’t need them, but which I expected to need at some point. These include queuing experiment trials according to a priority (which I planned to implement via Kueue) and multi-machine trials (which I planned to implement via Kubeflow Training Operator).

Credits

Matthew Behrens and Mick van Gelderen helped a lot with many aspects. Among other things, Matt actually got Talos running, including figuring out how to install it from inside Ubuntu and finding versions of Talos and the Nvidia system extensions that worked with our hardware, and Mick implemented most of the Launch tool, discovered Kaniko and proved it could work.

This work was done in 2024. ↩

JAX and Equinox: What are they and why should I bother?

2024-09-08T00:00:00-07:00

This post is written as a Jupyter notebook which you can run and edit using the link below:

Using Fidelity as a checking account to 10x your yield

2024-08-31T00:00:00-07:00

You can use an account at Fidelity as a checking account, meaning you can write and deposit checks and withdraw cash from ATMs. Why do this?

Higher yield

At Fidelity you can get much higher yield on your money without sacrificing liquidity. E.g., the Schwab checking account I used before switching to Fidelity currently pays 0.45%. The money in the equivalent account at Fidelity had an annualized yield of 4.96% last week.

To get that higher yield, Fidelity will invest your money in US treasury securities. The actual rate varies depending on the market. There are probably times when the yield will be lower than what you can get in a checking account. For example in 2021, Fidelity’s Government Money Market Fund earned 0.01% whereas the average the checking account was paying 0.03%. However on this low end, the absolute difference is negligible, but when interest rates rise, the difference is huge (e.g. the current 4.96% vs 0.45%).

Some consider treasuries riskier than an FDIC-insured checking account. Personally I think the odds losing money in both are very similar. They both involve the USA government defaulting on its obligations (treasury debts in one case, FDIC insurance in another). Fidelity does offer an FDIC-insured investment that is currently paying 2.72%, so ~5x what a checking account pays with the same risk.

Account consolidation

You can choose to consolidate several accounts (checking, retirement investments, non-retirement investments, etc) at Fidelity and thus have one fewer financial institution to deal with. I personally have spending money (what I used to have in a checking account), non-retirement investments, a health savings account, and a retirement account there.

How to set it up

There’s two ways: use a brokerage account or a cash management account. Here’s how they compare along the main axes I care about:

	brokerage account	cash management account
ATM fees reimbursed	only if you have >$250k across all Fidelity accounts	yes
FDIC insured option	no	yes (lower yield)
use 1 account for checking + investments	yes	no

When you open your account you’ll need to select what your “cash” holding is actually invested in. Currently the default for both is the Fidelity Government Money Market Fund (SPAXX). Then when you deposit money into your account it automatically is used to purchase shares of that holding.

The routing number that Fidelity gives you for making direct deposits or withdrawals is actually associated with another institution (since Fidelity is not actually a bank that offers check accounts it has to partner with another institution to access the ACH system for transfers), so don’t be alarmed if you enter the routing number and some website says it’s not Fidelity.

There’s lots more details over at the Bogleheads wiki. The one inaccuracy I noticed there is it says that brokerage account ATM fees are reimbursed only for “Private Client Group”, but it’s actually both Private and Premium clients and Premium is a lower threshold (>$250k assets across all your Fidelity accounts, as of writing).

Kagi vs Google search: a personal evaluation

2024-08-17T00:00:00-07:00

Kagi is a relatively new search engine. Unlike Google, it makes money through user subscriptions and shows no ads. Despite having decreased my usage of web search since the release of ChatGPT, I still use it a lot, and would be willing to pay a few bucks a month for a significantly better experience. To evaluate Kagi, I put 75 of my recent search queries into Kagi and Google and rated which I preferred. The queries spanned various topics, heavily tilted towards software engineering and computer topics.

Summary

After this experiment I’ve decided to pay for Kagi and set it as my default search engine on both my phone and laptop.

Here’s a qualitative comparison and some thoughts:

When Google shows ads on my phone, it really hurts the experience since it takes up the whole screen (often two screens of scrolling) and the ads are very very rarely relevant (with the exception of Google shopping results, which are sometimes relevant). On desktop the ads are a minor annoyance since I typically can still see the non-ad results without scrolling, and in general when I’m using my laptop I’m in less of a hurry. However maybe only 1/10 of my queries trigger non-Google-shopping ads. Probably because many of my queries are very specific and technical. As noted, Kagi doesn’t show any ads ever.
Google is better at extracting relevant information (either from web results or structured data like stock prices) and putting it at the top of the search results. For Kagi this information is usually in the pages that are at or near the top, but it takes an extra click to get it. E.g. a graph of a stock’s price.
Kagi shows more results from somewhat obscure, non-commercial sites and blogs. For some of my queries, these sites had excellent content that I would be very unlikely to find via Google.
Google has a lot of features that I don’t care about that you might (for example, live sports scores).
I didn’t thoroughly evaluate queries where I was trying to buy products online. Kagi doesn’t have a shopping search feature, and I expect I will probably continue to use Google shopping in addition to other sites to shop.

Having worked at Google on search and seen how much human ingenuity and money went into building it, it’s pretty shocking that a 37 person (as of 2024-04) company can compete at all, but here we are!

Detailed results

Tie: 47 / 75
Strongly prefer Google: 3 / 75
Strongly prefer Kagi: 4 / 75
Weakly prefer Google: 11 / 75
Weakly prefer Kagi: 10 / 75

Google big wins

“bryant controlbox google home”. This reddit post is the only satisfying result on either, and it’s in the first few results for Google but not for Kagi.
“piedmont california front setback requirements”. Google has an “AI Overview” with the answer (which appears to have been extracted from a PDF). Kagi’s top result doesn’t have the answer on the page, though it does link to the PDF that contains the answer. It would take at least a minute of careful reading of the page that Kagi returned to figure out which link to click to get the right PDF, and then loading and searching in the PDF might take another minute.
“intc stock”. Google has a nice interactive graph. Kagi has some data (like current price, 52 week range), but I like the interactive graph more.

Kagi big wins

“lugg movers”. Google starts with several ads for other companies (competitors to Lugg I assume). On my phone, I needed to scroll down two full screens to get past the ads to the actual result I wanted. Kagi had no ads, and had the official Lugg page (which is what I wanted) at the top.
“josefk simt”. Kagi returned exactly what I wanted, which was this page from “yosefk.com” even though I misspelled the domain name in the query. Google seems to have decided there were not enough relevant results for my whole query so it searched for just “josefk” and then noted next to each of the results that the result is “Missing: simt”.
“how to value startup options”. On Desktop, Google starts with 4 ads which are totally irrelevant to what I wanted (though on my phone it didn’t show any ads). After that it had an AI summary which seemed reasonable and itself linked to some pretty good results. After that it had some decent web results. Kagi had no ads and had some of the same results as Google, but only Kagi had this gem from Ben Kuhn near the top. Reading that lead to lots of other relevant links on that same site.
“union find algorithm”: Google’s top result is GeeksforGeeks, which has relevant info but is not presented particularly well and the page has a huge amount of annoying animated ads. Google’s second result is to a pretty useful Wikipedia article. Kagi links to Wikipedia first, and second to this page which has no ads and has nice illustrations of the algorithm.

Attention, Memory, and Productive Knowledge Work

2024-06-09T00:00:00-07:00

Here I present some ideas for increasing the productivity of knowledge workers by structuring their workflows around attention and memory. I wrote this for my own benefit, but I hope you find it useful too!

Workflow matters

By “workflow” I mean loosely how execution tasks are scheduled and coordinated. By “execution tasks” I mean the activities which more-or-less-directly create value. For a software engineer, these tasks include programming and designing.

Much of the most influential thinking about optimizing workflows to increase productivity comes from the automobile industry. The history of car manufacturing has several inspiring examples. In 1909, a Ford Model T Runabout sold for $27,977 (in 2024 USD). In 1925 (16 years later), it sold for $4,517 (also in 2024 USD)¹. The number of Model Ts you could buy per dollar increased by over 6x!

Much of this increasing productivity was due to changes in the workflow. One major change was the introduction of the moving assembly line. Prior to the assembly line, cars were built through “the craft method”, in which teams of fifteen workers worked simultaneously on a single car”², which makes me think of young children playing soccer. This was inefficient in many ways. People got in each others’ ways, and they had to spend time walking around the factory to go between cars. With a moving assembly line, parts came to the workers and each worker could complete their stage of production without having to walk, coordinate with others, move tools, etc.

Attention and memory matter

In manufacturing, the main inputs wer materials, equipment and manual labor. In knowledge work, the main input is human minds. To increase productivity, we need to produce more output without increasing inputs. One way to do this is to optimize the workflow! One way to optimize the knowledge worker workflow is to understand some properties of attention and memory.

“Working memory is a cognitive system with a limited capacity that can hold information temporarily.”³ It is essential for reasoning and decision making, which are crucial in knowledge work. The set of mental objects you can mentally manipulate at one time is limited by the capacity of your working memory. After switching tasks, it takes time to build up working memory.⁴

Long-term memory is the system that lets you restore previous working memories. “Forgetting” means something being lost from long-term memory. The “forgetting curve”⁵ is a stylized fact. The longer you go without retrieving a memory, the more likely you are to forget it.

This conceptualization of human memory is quite similar to how computers work: working memory is analogous to a computer’s volatile memory (e.g., registers), long-term memory is analogous to persistent storage (e.g., a flash drive), and forgetting is analogous to deleting a file.⁶

If it’s not obvious by now, workflow interacts with how our memories work. Every time one switches tasks, one must repopulate working memory before becoming productive. And extended periods between working on a task leads to forgetting. One must re-learn, which takes time.

Ways our workflow makes us less productive

Yet many knowledge workers switch tasks very often. This has obviously been the case in many places I’ve worked, but there is some objective data to support this impression: A report from the summer of 2018 analyzed data from over fifty thousand active users of the RescueTime time tracking software. It found that the median time between checking communication apps like email and Slack was 6 minutes, and more than 2/3 of the users never experienced an hour of uninterrupted time.⁷

Besides this short-term switching between execution and collaboration, people often switch between tasks that they are executing. On several projects I worked on it was common to have work items that went unfinished for months, with sporadic bouts of work spaced weeks apart. Work done this way has many costs, but it definitely incurs costs of forgetting and re-learning.

Suggestions

At a high level:

Minimize context switches so as to avoid cost of loading things into working memory.
Minimize time between sessions of work on a single task so as to avoid forgetting.
Remember that convenience ≠ productivity.

And now some specific ways to put these principles into practice.

Use meetings well

Instant messages, e-mail, and interactions on doc comments are all asynchronous. Each message involves a context switch. Meetings are synchronous, rapid, concentrated communication. My rule of thumb: after the third message in an E-Mail or IM conversation, it’s better to switch to a meeting. An illustration of the cost of context switches that can be avoided by a meeting:

Meetings certainly have their own costs, and are often run poorly, but producing fewer context switches is a huge and underappreciated advantage of meetings over asynchronous communication.

Meeting tips

Regularly scheduled meetings are useful for regular, non-urgent communication. Participants know they’ll be able to discuss things relatively soon, and therefore can avoid resorting to asynchronous communication. Between meetings, participants can collect agenda items in a document as they arise. This is an example of convenience ≠ productivity. If I think of something to ask my coworker, it’s more convenient for me to IM him. But, if it’s not urgent, it’s more productive for me to write it the agenda of our next regularly scheduled meeting.

For group meetings, it can be efficient to have a structured way for participants to schedule smaller-group follow-up meetings. When I was a manager at Microsoft, my team’s regular sync meetings were 60 minutes, but the whole team was only expected to meet for at most 30 minutes, and the rest of the hour was used for smaller group follow-up meetings. This avoids a context switch between blocks of meetings and blocks of solo wrk. And it avoids asynchronous back-and-forth to schedule a follow-up meeting.

Finally, there are many ways meetings can be inefficient, but if participants are vigilant and vocal, they can be improved (or cancelled! Not all meetings are worthwhile).

Schedule asynchronous communication

By default, don’t leave your inbox open, don’t leave your IM app open, and don’t leave your phone notifications on. Check these things on a schedule that balances responsiveness to others with your own ability to focus. Personally I follow a loose schedule of checking things first thing in the morning, immediately before meetings, and once or twice during the afternoon, when I happen to feel blocked or need a mental break.

I used to have a problem with getting distracted by my inbox every time I sent an email. To send email without checking your inbox, you can use this link for GMail or this one for Outlook.

Schedule focused work

One technique for avoiding self-imposed distraction is called “Pomodoro”, which basically consists of setting a timer, and taking a break when the timer goes off.

To increase the odds of having large blocks of time to focus, schedule events on your calendar that prevent others from scheduling meetings. If you have the option to work in a place that is quiet and physically isolated, try to do that during your scheduled focus blocks.

Limit the number of in-progress tasks

Limiting the number of tasks that you have in-progress can help reduce the temptation to context switch (and flush your working memory) and it will reduce the odds that you forget important details about one incomplete task while you’re working on another. This is a key feature of Kanban and Scrum.

Use tools to disseminate commonly needed information

Some information is so commonly needed that the questions should be anticipated and built into tools that are used as part of the regular workflow. For example “Who is working on this task?” Proper use of an issue tracker (e.g., GitHub Issues, Asana) can answer this without the back-and-forth of asynchronous communication or the time burden of a meeting. If you find there is some question like this that is repeatedly asked, but has fairly formulaic answers, check if there’s a tool that you can adopt that will disseminate that information more efficiently.

Speed up testing and reviews

This is somewhat specific to software development, but it probably has analogs in other professions.

The “testing and reviews” part of the software workflow typically looks like:

While not approved:

Author: rebase, (read / think, write, build, run) until ready.

Wait for automatic checks.

Reviewer: read / think, comment. Maybe approve.

This leaves plenty of room for context switching and forgetting:

While not approved:

Wait for author to start. Author and reviewer forget.

Author: rebase, (read / think, write, build, run) until ready. Author context switch. Reviewer forgets.

Wait for max(automatic checks, reviewer to start). Author and reviewer context switch + forget.

Reviewer: read / think, comment. Maybe approve. Reviewer context switch. Author forgets.

What can we do about this? If automatic checks are frequently the bottleneck, spend time speeding them up. If code reviews are the bottleneck, speed those up. On a previous team I set up a duty rotation to review any changes that did not yet have a reviewer. You might also experiment with pair programming, which basically combines code review and programming.

Acknowledgements and further reading

Besides my own experience, this post is based on the following:

A World Without Email by Cal Newport.

Deep Work by Cal Newport.

Getting Things Done by David Allen.

More Effective Agile by Steve McConnell.

https://en.wikipedia.org/wiki/Ford_Model_T#Price_and_production ↩
A world without email, page 97. ↩
https://en.wikipedia.org/wiki/Working_memory ↩
https://en.wikipedia.org/wiki/Psychological_refractory_period ↩
https://en.wikipedia.org/wiki/Forgetting_curve ↩
https://en.wikipedia.org/wiki/Memory_hierarchy ↩
A world without email, page 11 ↩

Why I considered IVF despite not having any fertility issues, and then decided against it

2024-03-10T00:00:00-08:00

In The dangers of reproducing while old, I mentioned that pre-implantation genetic testing could be a way for older prospective parents to improve outcomes. This led my partner and I to seriously consider doing IVF despite not having any fertility issues. At a high level, my conclusion after writing that post was:

Older parents are at much higher risk of passing harmful genetic mutations to their embryos.
IVF gives people the opportunity to screen for these harmful mutations, potentially avoiding miscarriage or serious health conditions later on. It also provides an opportunity to use polygenic screening to select for desirable traits.
There’s not much evidence that IVF results in worse health outcomes for children.

After investigating this more, I now think:

The benefit of polygenic screening is currently generally small, and in our case it would be tiny.
There’s some evidence of IVF producing worse health outcomes.

Polygenic screening

I think polygenic screening¹ has a lot of potential, but currently it has serious limitations, including:

The models for certain traits like intelligence just aren’t very good at predicting. This is called the “missing heritability” problem, and it’s quite controversial exactly what’s going on, but some of the issues are clear. One is that current models are based on data that measured other things, like educational attainment, which are correlated with intelligence, but not perfectly (e.g. Deary et. al find 0.81). And it’s partly because some of the variance in traits we care about is probably caused by very rare variants which would require huge sample sizes to detect.
The models are much better for people of certain ancestry than others, because of the data that they were created with. My impression is that the models currently work best for people of northern European ancestry.
The expected variance amongst embryos from the same parents is pretty low. To have good odds of finding an embryo that has polygenic scores much better than the average of its genetic siblings (i.e. the other embryos parents will be choosing amongst²), one needs a lot of embryos to choose from.

Point 1 should decrease the appeal of polygenic screening for everyone. Point 2 decreased it for me because my children would be mostly not-northern European. As for the number of embryos, my partner took a test that suggested a round of egg retrieval would yield relatively few eggs. This meant she’d probably have to undergo multiple rounds of egg retrieval, which is expensive and unpleasant.

It’s hard to find good information on this topic. It’s not hard to find people saying that the technology doesn’t work, but I get the distinct whiff of motivated reasoning from articles like this one by Lior Pachter. Basically I think the author thinks polygenic screening is morally wrong or disgusting, and therefore he’s finding reasons to say it won’t work.

I want polygenic screening to work. If we were doing IVF anyways, I would definitely have the embryos polygenically screened.

IVF health outcomes

I found the excellent paper The health of children conceived by ART: ‘the chicken or the egg?’, which looks at studies that try to control for the systematic differences between people who pursue IVF and those who don’t. That review, and newer studies such as Sutclifffe (2023) did not find large differences in IVF babies later in life. But the best-controlled studies do find a pretty large increased risk of preterm birth (relative risk somewhere in 1.5 - 2). Elsewhere I found the absolute risk of preterm birth for a 36 year old mother is about 6%, so IVF might take that up to 9-12%. Since preterm birth is associated with a lot of bad health conditions, it would be surprising if IVF children are more likely to be born preterm but are equally healthy later on.

I’m wondering how it’s possible to reconcile the preterm birth and generally good outcomes in adults. Some possibilities:

Preterm birth rates are not higher. The studies are just not controlling for something important.
Preterm birth caused by IVF is not associated with later bad health, but preterm birth caused by other things is.
There are negative impacts, but the studies of adults have missed them. Some possible reasons why: studies are too small, the negative health outcomes haven’t shown up yet because most IVF babies are too young, studies looked at the wrong metrics, there’s some selection bias such that the least healthy people are less likely to be studied (this one seems quite plausible to me).

My current guess is there’s a small but real tendency for people conceived via IVF to be less healthy later in life, though I’m extremely uncertain about the exact aspects of health, the magnitude and the frequency involved.

If you are not familiar with polygenic screening, I recommend Gene Smith’s post ↩
Unless you’re using donor gametes. In which case you might use multiple donors and compare across non-siblings. ↩

The dangers of reproducing while old

2023-11-10T00:00:00-08:00

I had my first child when I was 36 years old, which made me want to understand the risks of having children at different ages. Before looking into this, my impression was that the main biological problems with old-age parenthood had to do with not having the necessary health and vigor to care for young’uns, and I had heard that older women have trouble getting pregnant. While those are real issues, there are many others worthy of consideration.

My read of the evidence is that the risks of miscarriage and serious health problems for children, including autism and birth defects, increase significantly with parental (both paternal and maternal) age. The data I could find for most risks is not very fine-grained and not very precise, but I think this qualitative description matches the data: Risks start rising at around 30 years old for both mothers and fathers, rises gradually through about 35 for mothers and 40 for fathers, and then sharply after that.

Interestingly, the ages at which things start to go wrong are similar for fathers and mothers, but the mechanisms are different. Sperm cells are produced throughout a man’s life, and each time a new cell is produced, there is a chance of a genetic mutation. Sperm are produced by copying the DNA of other short-lived cells, which are themselves produced in the same way, so mutations accumulate. Women’s egg cells, however, are all present when a woman is born, but over time they accumulate damage.

If this is correct, then there are two ways to reduce these risks: have kids when young, or use frozen gametes from your younger selves. If you’re already in the danger zone and don’t have frozen gametes, pre-implantation genetic testing may be able to screen out embryos that have certain genetic defects, and thus reduce the risk of some bad outcomes.

My advice:

If you want to have kids at some later age, and that later age is >= 35 for a woman or >= 40 for a man, freeze your gametes ASAP.
If you’re already past those age thresholds and you have the means, consider in-vitro fertilization so you can take advantage of pre-implantation genetic testing.

In “the dangers” section below I summarize some evidence on how parental age interacts with various risks. What’s not obvious is the relationship between the different risks. That is, are they mostly independent of each other, or is a child born with e.g., a heart defect much more likely to be autistic? They are not independent. For example, Eide et al. find a significant correlation between birth defects and intellectual disability. So if you want to know “what are the odds my kid comes out totally healthy”, I think just looking at the highest risk and ignoring the others is reasonable.

If you’re interested in the details supporting the above conclusions, read on.

Technical jargon

Skip this if you know these terms.

Prevalence

The prevalence is what fraction of the population has the outcome of interest. Basically:

(number of people with the outcome) / (number of people that were studied).

Odds ratio

An odds ratio is a ratio of how likely the outcome of interest is in the condition of interest, to how likely it is in some reference condition. For the data below, the condition is always a particular parental age range, and the reference condition is some other age range that the researchers chose. For example, say we set the reference age to 25, and our outcome of interest is being born with green hair. If a study finds that children of fathers aged 30 have 1/10 odds of being born with green hair, and the children of fathers aged 25 have 1/100 odds of being born with green hair, then the odds ratio for age 30 is 1/10 / 1 / 100 = 10.

95% CI

A 95% CI (confidence interval) is a range of values. Under certain assumptions¹ there is a 95% chance that the true value falls within that range.

Pre-implantation genetic testing

Pre-implantation genetic testing is done on embryos that have been fertilized in-vitro before implanting them into a woman (more details here). After developing for about 10 days, embryos have enough cells that some can be removed for genetic testing. Sordia-Hernandez et al. look at the effects of testing for aneuploidy, which is a specific kind of genetic defect which often results in miscarriage or abortion. It finds significant benefits for women >= 35 years old, but not for younger women.

Note that for almost everything else, the outcome being tested for is bad, whereas here it is live birth rate, or the odds of a child being born alive after an embryo is transferred into a woman.

Mother’s age	Live birth rate odds ratio 95% CI (relative to no genetic testing)
< 35	0.56, 1.34
>= 35	1.07, 2.84

Very recently, some companies have started offering more in-depth genetic screening for embryos, such as assessing risk for polygenic traits, meaning influenced by many genes. The companies offering this service claim all sorts of benefits, such as reducing the risk of cancer and diabetes, but I don’t think it’s been independently evaluated, and it’s probably too new to truly evaluate, since there’s a very small number of people alive who were screened in this way. Here’s Gene Smith’s post that’s very enthusiastic about such screening and tells you how to go about it, and my response trying to summarize a skeptical position.

So if you’re older and you don’t have frozen gametes, should you do IVF just so you can do pre-implantation genetic testing?

Pros:

Very effective at detecting aneuploidy, and thus increasing live birth rate per pregnancy.
You can choose the child’s sex.
If you opt for polygenic screening, it is possible to reduce other health risks and possibly improve other desirable traits like IQ. Again, see Gene Smith’s post for more details on this.
I haven’t seen any strong evidence that IVF results in worse health outcomes. Note there are many studies that show worse outcomes for IVF, but since IVF is largely used by people who have fertility problems, and differences seem to disappear entirely when controlling for this. More here.

Cons:

Expensive in time and money (maybe $25,000 in 2023).
Will be more expensive and / or less effective for women who produce fewer eggs per retrieval, which is mostly older women.
There is some evidence that IVF results in differences in the embryo that might possibly result in less healthy people (vs old-fashioned conception). I think the odds that this results in worse outcomes are quite low, but it’s worth mentioning.

The dangers

Miscarriage

This chart from Magnus et al. shows the absolute risk by mother’s age. Y-axis is the proportion of pregnancies that end in miscarriage:

And here’s some data from du Fossé et al. on the risk by father’s age. For the absolute risk, I assumed the absolute risk for the reference age is 10%, which seems to be about the value for a 27 year old woman from the chart above.

Father’s age	Odds ratio 95% CI	Absolute risk 95% CI
25-29	reference	10%
30-34	0.9, 1.21	9%, 12.1%
35-39	0.92, 1.43	9.2%, 14.3%
40-44	1.06, 1.43	10.6%, 14.3%
>= 45	1.13, 1.81	11.3%, 18.1%

Autism

Prevalence

People with a huge range of abilities and tendencies are all diagnosed with autism, and there’s a lot of debate about the accuracy of many diagnoses. However “profound autism” is a diagnosis with much clearer criteria. Hughes et al. defined profound autism “as children with autism who were either nonverbal or minimally verbal or had an (intelligence quotient) IQ <50”. That study estimated the prevalence of profound autism in the USA as:

Female	1.88 / 1000 = 1 / 532
Male	7.18 / 1000 = 1 / 139
Overall	4.59 / 1000 = 1 / 218

These numbers seem shockingly high, but they do somewhat match my casual observations. I don’t know a lot of children, but I know of at least 2 profoundly autistic boys.

Risk by parental age

The studies I found on the impact of parental age did not restrict themselves to just profound autism, so it’s possible that parental age interacts with profound autism differently, but my guess is it’s at least qualitatively correct.

Here are the results from Durkin et al., who looked at both the father’s and mother’s ages. For the absolute risk for fathers, I used the number of autism spectrum disorder cases / size of the “Birth Chort Comparison Group” for the father’s or mother’s reference age, extracted from table 3 of the paper. For fathers that’s 322 / 67,080 = 0.48%. For mother’s that’s 366 / 75,053 = 0.49%. These numbers are close to the overall risk of profound autism from Hughes et al. above, but this study considered any autism diagnosis, so something is probably wrong either with my calculation or with one or both of these studies.

Father’s age	Odds ratio 95% CI	Absolute risk 95% CI
<20	0.4, 1.0	0.19%, 0.48%
20-24	0.7, 1.1	0.34%, 0.53%
25-29	Reference	0.48%
30-34	0.9, 1.2	0.43%, 0.58%
35-39	0.9, 1.3	0.43%, 0.62%
>= 40	1.1, 1.8	0.53%, 0.86%

Mother’s age	Odds ratio 95% CI	Absolute risk 95% CI
<20	0.5, 1.0	0.25%, 0.49%
20-24	0.8, 1.1	0.39%, 0.54%
25-29	Reference	0.49%
30-34	0.9, 1.3	0.44%, 0.64%
>= 35	1.1, 1.6	0.54%, 0.64%

Note from paper: “Because the increased risk was similar for ages 35–39 and ≥40 years, the high-risk maternal age category was defined as ≥35 years.”

And here are results from another study that looked only at the father’s age:

Father’s age	Odds ratio 95% CI
15-29	Reference
30-39	1.0, 1.42
40-49	1.07, 1.87
>= 50	1.26, 3.88

This chart shows absolute risks from that same study:

Chromosome disorders

Prevalence

Caron, Tihy, and Dallaire find that of mothers aged >= 35, 1.79% or 1 / 55 have a chromosomal disorder in the second trimester. Note that some chromosome disorders result in miscarriage earlier than that, so the true prevalence is certainly higher.

Risk by parental age

To compute absolute risk, I took the prevalence number from above and then divided it by 5.66 (the middle of the odds ratio CI for mothers aged >= 35) to get 1.79 % / 5.66 = 0.32%.

Father’s age	Odds ratio 95% CI	Absolute risk 95% CI
<20	1.01, 1.89	0.32%, 0.60%
25-29	reference	0.32%
>= 40	1.12, 1.52	0.36%, 0.49%

From Fang et al.

Mother’s age	Odds ratio 95% CI	Absolute risk 95% CI
<20	0.54, 0.88	0.17%, 0.28%
20-34	reference	0.32%
>= 35	5.13, 6.2	1.64%, 1.98%

From Ahn et al.

Urogenital defects

Prevalence

1.60 / 1000 = 1 / 625. Source.

Risk by parental age

I didn’t find an easy way to calculate absolute risk.

Father’s age	Odds ratio 95% CI
<20	1.03, 2.19
25-29	reference
>= 40	1.07, 1.52

From Fang et al.

Mother’s age	Odds ratio 95% CI
20-34	reference
>= 35	1.13, 1.89

From Ahn et al.

Heart defects

Prevalence

137.1 / 10,000 = 1 / 73. Source.

Note: this seems really high to me. Maybe most of these are not very serious, or maybe I know people who were born with heart defects but I don’t know they have them.

Risk by parental age

I didn’t find an easy way to calculate absolute risk.

Father’s age	Odds ratio 95% CI
<20	0.96, 1.16
25-29	reference
>= 40	1.01, 1.2

From Fang et al.

Mother’s age	Odds ratio 95% CI
<20	0.79, 1.1
20-34	reference
>= 35	1.06, 1.24

From Ahn et al.

Notes

Which scientists sometimes violate and thus invalidate their own results, but for now I’m just assuming these stats are sound. ↩

Multi-language integration testing made easy with Bazel

2023-10-09T00:00:00-07:00

When developing mlflow-go, a Go client for MLFlow, I realized integration testing was crucial to ensure correctness. Since the official MLFlow server and client are written in Python, such an integration test would involve multiple languages. Bazel made it easy to set up and automate a multi-language test.

Bazel is a build tool and test runner. There’s lots to love about Bazel, but in this case the useful things are:

It supports many programming languages.
Its “data” and “runfiles” features enable one Bazel target to access other build outputs at run-time.
It can automatically download all needed language compilers / runtimes and external dependencies.

To understand why I wanted an integration test, I need to explain a little about the system under test. The client logs to and reads from a local file system. In a unit test, I can assert that the client is able to write and read back what it wrote, but I can’t assert that the files are in a format that the official MLFlow code will understand, since there is no specification of the file format. The fact that I can read my own handwriting is pretty useless if I’m writing you a letter and you can’t read it. A test that uses both the Go and Python client libraries can enforce compatibility.

Here’s pseudo-code¹ for the test I ended up writing:

for client_binary in (go, py):
  temp_dir := make_temp_dir()
  run client binary in a subprocess, pointing it to write to temp_dir
  use official MLFlow client library to read from temp_dir
  assert we read what we expect to read

The test is declared in a BUILD.bazel file². Let’s walk through what’s in it. First we declare the binaries that write to MLFlow, using the Go and Python client libraries:

# go binary that uses our go client
go_binary(
    name = "go",
    srcs = ["main.go"],
    deps = ["//:mlflow"],
)

# python binary that uses the official client
py_binary(
    name = "py",
    srcs = ["main.py"],
    main = "main.py",
    deps = ["@pip//mlflow:pkg"],
)

Then we declare the test, which depends on the above binaries as data:

py_test(
    name = "conformance_test",
    srcs = ["conformance_test.py"],
    data = [
        ":go",
        ":py",
    ],
    deps = [
        "@pip//mlflow:pkg",
        "@rules_python//python/runfiles",
    ],
)

The data field in the conformance_test target means bazel will build those targets when it builds the test, and the test can access them at run-time. To access them, we use Bazel’s runfiles module. In my conformance_test, to access the “py” target I would write:

import python.runfiles.runfiles
# converts relative path to absolute path
binary_path = python.runfiles.runfiles.Create().Rlocation(“_main/conformance/py”)

Unfortunately the documentation for runfiles in Bazel is currently pretty bad (as it is for many topics), but the basic workflow is: Add the target for what you want to access at runtime (which can be a built binary, in this case our “py” and “go” targets) to the data field of the accessor’s target declaration (in this case the “conformance_test”). In the accessor’s code (in this case conformance_test.py), use a Bazel runfiles library to access the file.

Typically the “rules_<lang>” module that provides Bazel support for a given language includes a library to access runfiles. The one I used in my python test is “@rules_python//python/runfiles”.

Above I mentioned Bazel’s dependency management. In this case, the only thing a user needs to install is bazel. Then when the user runs bazel test //conformance:conformance_test, bazel will: Download a specific version of the Go toolchain (compiler and linker) Download a specific version of the Python interpreter Download all external Go and Python dependencies Use the above to build the needed targets Runs the test

And this works on Linux, macOS, and Windows.

There you have it. Let me know if you find this useful.

Latest full source. ↩
Slightly modified from the real thing for clarity, latest full source here. ↩