<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="http://garymm.org/feed.xml" rel="self" type="application/atom+xml" /><link href="http://garymm.org/" rel="alternate" type="text/html" /><updated>2026-02-06T20:43:24-08:00</updated><id>http://garymm.org/feed.xml</id><entry><title type="html">Testing software in the era of coding agents</title><link href="http://garymm.org/blog/2026/02/06/testing-software-coding-agents/" rel="alternate" type="text/html" title="Testing software in the era of coding agents" /><published>2026-02-06T00:00:00-08:00</published><updated>2026-02-06T00:00:00-08:00</updated><id>http://garymm.org/blog/2026/02/06/testing-software-coding-agents</id><content type="html" xml:base="http://garymm.org/blog/2026/02/06/testing-software-coding-agents/"><![CDATA[<p>What parts of my software should be tested? And how? And how do coding agents (e.g. Claude Code) change things? This is my attempt to succinctly explain how I think about these questions at the beginning of 2026 (software development is changing so quickly I feel compelled to note the date. This may all be obsolete soon).</p>

<h2 id="static-analysis--testing">Static analysis ⊂ testing</h2>

<p>For the purposes of this discussion, I’m going to include static analysis in the term “testing” for brevity rather than writing “testing and static analysis”. Static analysis includes any checks that don’t have to execute the code, including the checks performed by compilers and linters.</p>

<h2 id="why-test-software">Why test software</h2>

<p>If it’s not tested, you have much less evidence that it works, where “it” is any particular combination of code, inputs, runtime environment, and assertions. So to the extent you care about the software working, you should test it.</p>

<p>Most software is tested manually when it’s first created. What’s wrong with manual testing?
First, it’s slow! While the time cost of all testing grows with the number of changes being tested and the number of checks, the slope of the curve matters. Automated tests have a higher up-front cost of writing the test, but then a drastically lower marginal cost each time they’re run. The cheaper and faster the tests are, the faster one can make changes (while maintaining the same confidence in correctness).</p>

<p>The above was true before coding agents, but with coding agents, there are a few additional important considerations. First, the human time needed to actually write code is falling quickly (because you can have coding agents do it), but if checking correctness is slow (e.g. requires manual testing), that will become the bottleneck. Second, coding agents have a much higher chance of success if they can get quick feedback while they work. Third, manual tests often require more interpretation to determine if the system is working as intended, which is error-prone. This is especially true for coding agents that lack broader context and are somewhat prone to <a href="https://en.wikipedia.org/wiki/Reward_hacking">reward hacking</a>. Agents are much less likely to misinterpret an assertion failure than something less well defined, and reward hacking is much more visible if it shows up as disabling an assertion in code rather than an optimistic interpretation of a manual test result.</p>

<h2 id="how-much-to-invest-in-testing">How much to invest in testing</h2>

<p>There are two main things that determine how much it makes sense to invest in testing. First, how much does it matter that the system is correct (or performant)? At one extreme we have code that only needs to work well enough to explore an idea (e.g., for learning about an algorithm). Investing a lot in automated testing for this type of thing is wasteful. At the other extreme is code that is doing critical work everywhere for everyone (e.g., OpenSSL).</p>

<p>Second, how often is the code or inputs or environment changing? If the rate of change is low, the total cost of manual testing is low. If the code is changed frequently (including changes to its dependencies), manual testing is wasteful. As software systems get larger, they typically accrue more interdependencies between modules, which greatly amplifies the effective change rate.</p>

<h2 id="how-to-invest-your-testing-budget">How to invest your testing budget</h2>

<p>Build a test pyramid! Note the names of the layers probably don’t make sense, but in general you should have more isolated, fast tests, and fewer slow, integrated tests.</p>

<p><img class="wrap" src="/generated/2026-02-06-testing-software-coding-agents/test-pyramid-466-945f290ca.png" alt="Test pyramid diagram showing more unit tests at the bottom, fewer integration tests in the middle, and even fewer end-to-end tests at the top" srcset="/generated/2026-02-06-testing-software-coding-agents/test-pyramid-400-89c7e4ef1.webp 400w, /generated/2026-02-06-testing-software-coding-agents/test-pyramid-466-89c7e4ef1.webp 466w" /></p>

<p>Quoting “<a href="https://martinfowler.com/articles/practical-test-pyramid.html">The Practical Test Pyramid</a>”:
“Your best bet is to remember two things from Cohn’s original test pyramid:</p>

<ol>
  <li>Write tests with different granularity</li>
  <li>The more high-level you get the fewer tests you should have</li>
</ol>

<p>Stick to the pyramid shape to come up with a healthy, fast and maintainable test suite: Write lots of small and fast unit tests. Write some more coarse-grained tests and very few high-level tests that test your application from end to end.”</p>

<p>Again I think this becomes even more important with coding agents. They may have less of an understanding about which properties of the system are important and they will be much more productive if they can get feedback from static analysis tools or unit tests that are fast, specific, and not flaky.</p>

<h2 id="how-to-integrate-testing-into-the-development-process">How to integrate testing into the development process</h2>

<h3 id="when-fixing-bugs">When fixing bugs</h3>

<p>A bug or performance problem is very strong evidence existing testing is insufficient! If you’re fixing a bug or performance regression, this is a great time to practice test-driven development by following these 3 steps:</p>

<ol>
  <li>Write test. Verify it fails.</li>
  <li>Fix bug.</li>
  <li>Run test. Verify it passes.</li>
</ol>

<p>This guarantees that you’ve actually understood and fixed the problem, and can help avoid regressions.</p>

<p>However, people frequently find it’s quite difficult to write tests that reproduce bugs or performance regressions. The code often needs to be refactored to make it easier to test. This brings us to…</p>

<h3 id="when-writing-new-code">When writing new code</h3>

<p>If you’re writing new code that you expect to warrant testing (i.e. you care enough about the correctness or it’s going to be changed a lot), add tests from the beginning. This will naturally encourage you (or your AI agent) to design it in a way that makes it easy to test! You might find the “<a href="https://testing.googleblog.com/2025/10/simplify-your-code-functional-core.html">functional core, imperative shell</a>” pattern useful here.</p>

<h3 id="when-refactoring">When refactoring</h3>

<p>If you’re making a big change to the system that <em>shouldn’t</em> affect its output but you’re nervous and you feel like you should probably do a lot of manual testing to make sure, that’s a good sign that more automated tests may be needed.</p>]]></content><author><name>garymm</name></author><category term="programming" /><category term="software engineering" /><summary type="html"><![CDATA[What parts of my software should be tested? And how? And how do coding agents (e.g. Claude Code) change things? This is my attempt to succinctly explain how I think about these questions at the beginning of 2026 (software development is changing so quickly I feel compelled to note the date. This may all be obsolete soon).]]></summary></entry><entry><title type="html">Earl: a framework for scalable reinforcement learning research</title><link href="http://garymm.org/blog/2025/03/03/earl/" rel="alternate" type="text/html" title="Earl: a framework for scalable reinforcement learning research" /><published>2025-03-03T00:00:00-08:00</published><updated>2025-03-03T00:00:00-08:00</updated><id>http://garymm.org/blog/2025/03/03/earl</id><content type="html" xml:base="http://garymm.org/blog/2025/03/03/earl/"><![CDATA[<p>In this post I will briefly describe <a href="https://github.com/garymm/earl">Earl</a>, a reinforcement learning (RL) framework I wrote that enables scalable distributed training across multiple devices, and discuss some of the things I learned along the way.</p>

<p>Earl implements the two architectures described in “<a href="https://arxiv.org/abs/2104.06272">Podracer architectures for scalable Reinforcement Learning</a>”, which were used at DeepMind to scale training to very large batch sizes across many chips. Note these are not neural network architectures, but distributed RL architectures that can be used to train models that internally may use any neural network architecture. To prove it is usable, I used Earl to <a href="https://github.com/garymm/earl/tree/master/earl/agents/r2d2">implement the R2D2</a> algorithm as described in another DeepMind paper “<a href="https://openreview.net/forum?id=r1lyTjAqYX">Recurrent Experience Replay In Distributed Reinforcement Learning</a>”.</p>

<h2 id="background">Background</h2>

<p>To provide context, I’ll briefly summarize the Podracer architectures paper. If you know it, feel free to skip this.</p>

<p>In contrast to other machine learning paradigms, online RL involves an agent and an environment, and the training data is generated on the fly from their interactions. The paper describes two architectures: Anakin and Sebulba<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. Anakin is used when an environment is compatible with jax.jit, and Sebulba is used otherwise. In both of the architectures, the agent is implemented in JAX and is compatible with jax.jit. If you’re not familiar with JAX, see <a href="https://www.garymm.org/blog/2024/09/08/jaxwhat/">my introduction</a>. Basically, code that is run under jax.jit is optimized by a compiler and run on any supported device (e.g. GPU) without further involving the Python interpreter.</p>

<p>In Anakin, one can have the entire training loop (agent + environment interaction, loss function and optimization) happen under jax.jit and thus run on a device (e.g. GPU) without going back to the Python interpreter. In terms of writing a performant training loop, in some ways this is even easier to deal with than normal (supervised) machine learning since one does not need to copy any data from the host to the accelerator. Scaling this across multiple devices is trivial using JAX.</p>

<p>Below is a figure I created that is analogous to the paper’s figure 3 (see below) that illustrates Anakin-style RL. Notice that there is nothing running on the CPU! This figure may be misleading because the arrows don’t necessarily signify data being copied. It’s just data that is produced by one function being an argument to another function.</p>

<p><img class="wrap" src="/generated/2025-03-03-earl/anakin-800-16e671698.png" alt="anakin RL training architecture diagram" srcset="/generated/2025-03-03-earl/anakin-400-d1f7d320f.webp 400w, /generated/2025-03-03-earl/anakin-600-d1f7d320f.webp 600w, /generated/2025-03-03-earl/anakin-800-d1f7d320f.webp 800w, /generated/2025-03-03-earl/anakin-806-d1f7d320f.webp 806w" /></p>

<p>Although jax.jit-compatible environments are gaining more adoption in research, there are still many environments that can’t run under jax.jit that researchers care about. The Podracer solution to training on these at scale is called Sebulba. Sebulba involves splitting agents into actor and learner as shown in Figure 3 from the paper. Note that in this figure the arrows <em>do</em> signify data copies.</p>

<p><img class="wrap" src="/generated/2025-03-03-earl/sebulba-800-1b8f2b66e.png" alt="sebulba RL training architecture diagram" srcset="/generated/2025-03-03-earl/sebulba-400-edc91e3d3.webp 400w, /generated/2025-03-03-earl/sebulba-600-edc91e3d3.webp 600w, /generated/2025-03-03-earl/sebulba-800-edc91e3d3.webp 800w, /generated/2025-03-03-earl/sebulba-1000-edc91e3d3.webp 1000w" /></p>

<p>One of my main goals with Earl was to have a single agent implementation that could easily be run in either architecture. This is in contrast to what the team behind the Podracers paper did; the paper suggests they implemented agents twice, once for Anakin and once for Sebulba.</p>

<p>The “pod” in “Podracers” is an allusion to a collection of TPUs that are all connected with high bandwidth. Later in this post I will discuss what advantages TPUs actually provide.</p>

<h2 id="gymnax-loop-earls-anakin">Gymnax Loop: Earl’s Anakin</h2>

<p>Earl’s implementation of Anakin is GymnaxLoop. <a href="https://github.com/RobertTLange/gymnax">Gymnax</a> is a collection of RL environments implemented in JAX with a common interface, and Earl adopted that interface because it seemed more widely used than the alternatives. The GymnaxLoop implementation is mostly straight-forward, so here I only discuss some of the trickier things that I solved.</p>

<h3 id="avoiding-recompilation">Avoiding recompilation</h3>

<p>The first time a jax.jit function is run, it is compiled, which is slow. Unwanted recompilations are a performance foot-gun, so by default Earl will fail if the code is recompiled. When I enabled this failure, I learned that my Gymnax Loop was recompiling the main loop (act and learn) because the types of some part of the environment state were changing. After digging into it (made difficult due to a <a href="https://github.com/jax-ml/jax/issues/23302">JAX bug</a> that I reported) I discovered that the change was that the initial state (returned by env.reset()) had weak_type=True on some arrays, but calls to env.step() changed the weak_type to False. GymnaxLoop fixes this by setting weak_type=False on all arrays in the environment state before running. This avoids recompilation and thus speeds up training significantly.</p>

<h2 id="gymnasium-loop-earls-sebulba">Gymnasium Loop: Earl’s Sebulba</h2>

<p>Earl’s implementation of Sebulba is GymnasiumLoop. <a href="https://gymnasium.farama.org/index.html">Gymnasium</a> is a widely-used interface for RL environments, which generally are not compatible with jax.jit. The GymnasiumLoop has a much more complex design, was much trickier to implement correctly and was also harder to optimize for performance.</p>

<p>Here’s a diagram showing the system architecture. Sorry the text is a little small. You can open the image in a new window to zoom in.</p>

<p><img class="wrap" src="/generated/2025-03-03-earl/gymnasium-loop-800-ced76e552.png" alt="Gymnasium Loop system diagram" srcset="/generated/2025-03-03-earl/gymnasium-loop-400-bb341e4c4.webp 400w, /generated/2025-03-03-earl/gymnasium-loop-600-bb341e4c4.webp 600w, /generated/2025-03-03-earl/gymnasium-loop-800-bb341e4c4.webp 800w, /generated/2025-03-03-earl/gymnasium-loop-1000-bb341e4c4.webp 1000w" /></p>

<p>And here are results from a test showing linear scaling on up to 6 learner devices (TPU v2 cores):</p>

<p><img class="wrap" src="/generated/2025-03-03-earl/gymnasium_loop_scaling-563-9129f04da.png" alt="Gymnasium Loop scaling graph" srcset="/generated/2025-03-03-earl/gymnasium_loop_scaling-400-d0264f2f1.webp 400w, /generated/2025-03-03-earl/gymnasium_loop_scaling-563-d0264f2f1.webp 563w" /></p>

<p>Before going further into details, why is all this complexity needed in the first place? That is never explicitly addressed in the podracers paper. I think the key thing is that in a naive loop of env.step(), agent.act(), the device (GPU) will be idle during env.step() and the CPU will be idle during agent.act(). So we can get much better throughput by double buffering: have one batch of actions being computed by the agent at the same time that a batch of observations is being computed by the environment. But in order to take advantage of this double buffering, you need the learning to be able to learn from batches of trajectories from different sets of environments that are delivered out of order. And that basically implies an actor-learner split, and once you’ve split the agent thus, one can get further throughput gains by scaling the number of actors and learners independently. And that’s basically the architecture: separate sets of actors and learners, communicating asynchronously, scaled independently.</p>

<p>OK, now some details.</p>

<h3 id="agent-state-organization">Agent state organization</h3>

<p>A key design challenge was creating a flexible agent architecture that could work efficiently in both Anakin and Sebulba paradigms. Earl has two main base classes: <a href="https://github.com/garymm/earl/blob/496737c4f3172caa151477d47280b3a172525138/earl/core.py#L57">AgentState</a> and <a href="https://github.com/garymm/earl/blob/496737c4f3172caa151477d47280b3a172525138/earl/core.py#L233">Agent</a>. These are structured to enable Sebulba-style training while attempting to leave the user lots of freedom (they’re also used for Anakin-style training but they’re way too complex if that’s the only thing you need).</p>

<p>AgentState has the following fields:</p>

<ul>
  <li>Actor. This is read and written by the actor.  This is also read by the learner when calculating the loss.  In agents that use recurrent networks, this includes the recurrent hidden states.</li>
  <li>Nets. This holds the neural networks  This is read by the actor. It is read and written by the learner. Anything that needs a gradient computed needs to be in the networks.</li>
  <li>Opt. Anything other than nets that also needs to be updated when optimizing (i.e. updating the networks). This is where optimizer state belongs.</li>
  <li>Experience. This is state that is based on the trajectories accumulated by actors and sent to the learners.  For agents that use experience replay, this contains replay buffers.</li>
</ul>

<p>And the key methods in Agent are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">act(self, actor_state: _ActorState, nets: _Networks, env_step: EnvStep) -&gt; ActionAndState[_ActorState]</code></li>
  <li><code class="language-plaintext highlighter-rouge">update_experience(self,  experience_state: _ExperienceState,  actor_state_pre: _ActorState,   actor_state_post: _ActorState,  trajectory: EnvStep) -&gt; _ExperienceState</code></li>
  <li><code class="language-plaintext highlighter-rouge">partition_for_grad(self, nets: _Networks) -&gt; tuple[_Networks, _Networks]</code></li>
  <li><code class="language-plaintext highlighter-rouge">loss(self, nets: _Networks, opt_state: _OptState, experience_state: _ExperienceState) -&gt; tuple[Scalar, _ExperienceState]</code></li>
  <li><code class="language-plaintext highlighter-rouge">optimize_from_grads(self, nets: _Networks, opt_state: _OptState, nets_grads: PyTree) -&gt; tuple[_Networks, _OptState]</code></li>
  <li><code class="language-plaintext highlighter-rouge">shard_actor_state(self, actor_state: _ActorState, learner_devices: Sequence[jax.Device]) -&gt; _ActorState</code></li>
</ul>

<p>The method signatures and AgentState structure force algorithms to be implemented such that GymnasiumLoop can run any agent in a scalable manner.</p>

<h3 id="implicit-double-buffering">Implicit double buffering</h3>

<p>When I first thought about double buffering I thought I would write code that used two CUDA streams to overlap work. I was surprised to learn that JAX does not expose CUDA streams or any similar abstraction. Upon re-reading the Podracers paper, I noticed they wrote:</p>

<p><em>To make efficient use of the actor cores, it is essential that while a Python thread is stepping a batch of environments, the corresponding TPU core is not idle. This is achieved by creating multiple Python threads per actor core, each with its own batched environment. The threads alternate in using the same actor core, without manual synchronization.</em></p>

<p>So I tried just having multiple threads use the same device, and lo-and-behold I got a huge speedup! Looking at a profile in Nvidia Nsight Systems revealed that under the hood, JAX had analyzed the computations coming in from the different threads, determined they were independent, and scheduled them on separate CUDA streams (really separate CUDA graphs). This is in contrast to PyTorch which by default puts all work on a single stream and requires the user to specify another stream if desired.</p>

<p>Below you can see the Nsight Systems UI showing the different CUDA graphs at the top and the agent.step() overlapping with the env.step() at the bottom.
The two graphs of interest are 12 and 15. The profile records which thread launched each graph which confirmed that the two threads were launching separate graphs.</p>

<p><img class="wrap" src="/generated/2025-03-03-earl/nsight-two-threads-800-a6b5fc19f.png" alt="Nsight Systems profile of two actor threads" srcset="/generated/2025-03-03-earl/nsight-two-threads-400-e1ef4ba4c.webp 400w, /generated/2025-03-03-earl/nsight-two-threads-600-e1ef4ba4c.webp 600w, /generated/2025-03-03-earl/nsight-two-threads-800-e1ef4ba4c.webp 800w, /generated/2025-03-03-earl/nsight-two-threads-1000-e1ef4ba4c.webp 1000w" /></p>

<h3 id="batching-and-sharding-data">Batching and sharding data</h3>

<p>The paper suggests that experience data is copied from the actors to the learners one batch at a time. This seems quite inefficient. I instead break up acting into cycles of configurable length, and copy one cycle’s worth of batches at a time from the actor to the learners (i.e. num_envs * steps_per_cycle units of observations, actions, rewards, etc).</p>

<p>The paper does not address the details of how the data is stored and retrieved for replay. In Earl, the user specifies num_envs, which for GymnasiumLoop is the number of environments per actor thread. There are two actor threads per actor device. Each actor thread shards the trajectory and actor state evenly across the learner devices. Thus when the framework calls Agent.update_experience() on the learner device, the experience data has batch size = num_envs / len(learner_devices), which must be an integer (i.e. must divide evenly). The Agent is free to store and replay that experience in whatever way it chooses. For my R2D2 implementation, to keep things simple, I store the experience using that same batch size (num_envs / len(learner_devices)) and then replay some batch size that is an integer multiple of that batch size.</p>

<p>One thing not mentioned in the paper but which is obviously necessary for many algorithms is copying of actor state to the learners. For example, in R2D2 the LSTM hidden states at the beginning of a trajectory are needed by the learners. The framework can take care of properly distributing the observations, actions and rewards, but the details of the actor state depend on the particular agent implementation, so users of GymnasiumLoop have to implement Agent.shard_actor_state(actor_state, learner_devices). Depending on the algorithm, some elements of the state will be sharded evenly to go along with the trajectory data, while other elements will be replicated across all learner devices or not copied at all.</p>

<h3 id="performance-tuning">Performance tuning</h3>

<p>In GymnasiumLoop, ideally all accelerator devices are being fully utilized. Getting there requires a lot of tuning. Some of the knobs available for tuning and what they do:</p>

<ul>
  <li>Num_envs: the number or environments per actor thread (there are 2 actor threads per actor device). Increasing this will increase CPU usage during env.step() and increase actor device (e.g. GPU) usage during agent.act(). It will increase CPU memory usage (for the environment state). It will also increase memory usage on the actor device, moreso if the actor maintains per-environment state (e.g. recurrent hidden state).</li>
  <li>Num_off_policy_optims_per_cycle: the number of times Agent.loss and Agent.optimize_from_grads is called between waiting for new experience data from the actors. Increasing this will increase learner device usage. It may cause the actor threads to block (and thus make the actor devices and CPUs idle) if the queue for experience data is full (currently the queue has a max length of 2). Increasing it will also make the algorithm more off-policy, since it does more updates on experience that was produced by older policies.</li>
  <li>The number of actor devices and learner devices. More learner devices effectively increases batch sizes and thus can help training be faster or more stable. More actor devices increases the rate at which new experience trajectories are made available to the learners. If the number of environments on a machine is limited by CPU cores or CPU memory, increasing the number of actor devices effectively reduces the actor batch size (num_envs).</li>
</ul>

<p>The metrics that are currently exposed on every run are the cycle time for the learners (which includes getting new experience and then some number of loss + optimization steps), and the time the learners spend waiting for an actor to enqueue experience. Because JAX arrays are materialized asynchronously, the actor thread’s call to jax.device_put_sharded() will return before the data has actually been copied to the learner devices. Thus the learner device will be able to successfully retrieve experience from the queue, but computation may block waiting for the data to be copied. I don’t think there’s a good way to expose the exact amount of time spent waiting for copies during normal execution (doing so would require putting in barriers that could hurt performance). So the process I used for tuning performance was:</p>

<ol>
  <li>If learner device utilization is not high, try tweaking the above knobs to get it up.</li>
  <li>When / if that didn’t succeed, use a profiler (I used NVidia Nsight Systems). This made it fairly easy to see when computation was waiting on copies.</li>
</ol>

<h3 id="performance-footgun-implicit-vs-explicit-host-device-copies">Performance footgun: implicit vs explicit host-&gt;device copies</h3>

<p>Using the profiler I was able to spot a blocking host-&gt;device copy in the inner loop of the actor cycle that was caused by something like:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">steps_per_cycle</span><span class="p">:</span>
  <span class="n">observation</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">reward</span> <span class="o">=</span> <span class="n">env</span><span class="p">.</span><span class="nf">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
  <span class="n">observation</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">reward</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">numpy</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">observation</span><span class="p">),</span> <span class="n">jax</span><span class="p">.</span><span class="n">numpy</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">done</span><span class="p">),</span> <span class="n">jax</span><span class="p">.</span><span class="n">numpy</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">reward</span><span class="p">)</span>
  <span class="n">action</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="nf">act</span><span class="p">(</span><span class="n">observation</span><span class="p">,</span> <span class="n">done</span><span class="p">,</span> <span class="n">reward</span><span class="p">)</span>
</code></pre></div></div>

<p>It turned out that the explicit conversion from Numpy to JAX arrays was much much slower than just passing the Numpy arrays directly into Agent.act. I confirmed the issue with this simplified example:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="n">jax</span>

<span class="nd">@jax.jit</span>
<span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
 <span class="k">return</span> <span class="n">a</span><span class="o">+</span><span class="n">b</span><span class="o">+</span><span class="mi">1</span>

<span class="k">def</span> <span class="nf">lazy</span><span class="p">():</span>
 <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">ones</span><span class="p">((</span><span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">))</span>
 <span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">ones</span><span class="p">((</span><span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">))</span> <span class="o">*</span> <span class="mi">2</span>
 <span class="k">return</span> <span class="nf">add</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">eager</span><span class="p">():</span>
 <span class="n">a</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">numpy</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nf">ones</span><span class="p">((</span><span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">)))</span>
 <span class="n">b</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">numpy</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nf">ones</span><span class="p">((</span><span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">))</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
 <span class="k">return</span> <span class="nf">add</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
</code></pre></div></div>

<p>The lazy function takes 0.3 milliseconds and the eager takes 1.1 (3.7x slowdown) on a Google Colab instance with a T4 GPU.</p>

<p>Under the hood, the eager function launches 3 CUDA kernels, one for each array copy and one for the addition, returning to the Python interpreter between each. The lazy function goes into CUDA only once.</p>

<h3 id="batching-gymnasium-environments">Batching Gymnasium environments</h3>

<p>In the podracers paper section in Sebulba they write:
<em>To minimise the effect of Python’s GIL, when stepping a batch of environments in parallel, each Python actor-thread interacts with a special batched environment; this is exposed to Python as a single environment that takes a batch of actions and returns a batch of observations; behind the scenes it steps each environment in the batch in parallel using a shared pool of C++ threads.</em></p>

<p>The functionality described in the paper is provided for some environments by <a href="https://envpool.readthedocs.io/en/latest/">EnvPool</a>. For Gymnasium environments not supported by EnvPool, Earl will apply Gymnasium’s built-in vectorization which uses Python multiprocessing to run multiple copies of the environment in parallel. This is much much slower than EnvPool, and one fun thing I had to work around was that each subprocess would try to pre-allocate most of the GPU memory on startup (this happens whenever you import jax). I worked around this by setting an environment variable telling JAX to only use the CPU in those environment subprocesses.</p>

<h2 id="potential-improvements">Potential improvements</h2>

<h3 id="pmap---automatic-parallelism">Pmap -&gt; automatic parallelism</h3>

<p>When the Podracers paper was written, jax.pmap was the recommended way of parallelizing computation across multiple devices. Since then, the JAX team has developed “<a href="https://docs.jax.dev/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html">automatic parallelism</a>” and encourages its use over pmap. The basic idea is that the programmer shards (or replicates, which in JAX is called a type of sharding) arrays across devices, and the compiler and runtime automatically figure out where computation should happen and where function outputs should go.</p>

<p>I prototyped an implementation of GymnaxLoop that used automatic parallelism before throwing it away and settling on the explicit Pmap approach. The reason is that I couldn’t convince myself that sampling randomly from a replay buffer wouldn’t result in extra cross-device copies and uneven workloads. Earl is currently entirely agnostic to how an agent manages its experience state (which will include the replay buffers). Experience replay could be implemented in a way that is compatible with automatic parallelism (I believe the main constraint is that buffer has to be sized such that it can be sharded evenly across devices, and reads and writes are balanced across all devices), but guaranteeing this would require the framework to be more opinionated about how replay buffers are managed.</p>

<p>If I were to do this, I would look to <a href="https://dm-acme.readthedocs.io/en/latest/">DeepMind’s Acme</a> for inspiration. It is extremely prescriptive about how experience state is managed, and I think a similar design could result in something that’s guaranteed to be performant with JAX’s automatic parallelism.</p>

<h3 id="multiple-losses">Multiple losses</h3>

<p>Some algorithms compute different loss terms for different subsystems. Earl doesn’t currently support this, but it wouldn’t be too hard to add.</p>

<h2 id="scaling-to-multiple-machines-or-how-special-are-the-pods-really">Scaling to multiple machines, or how special are the “pods” really?</h2>

<p>Earl currently only supports single-machine training. Supporting multi-machine would be as straightforward as adding a call to jax.distributed.initialize() in the training script. However, when scaling to multiple machines, network bandwidth becomes a critical factor. Let’s analyze how bandwidth affects training throughput and compare TPU pods with modern GPU clusters.</p>

<p>The “pod” in the “Podracers” article is a reference to a Google Cloud TPU pod, which is a group of TPU chips that have high bandwidth interconnections. Let’s analyze how bandwidth affects training throughput. Both Anakin and Sebulba have to send gradients between all learner devices before every optimizer step, and this latency cannot easily be hidden by overlapping that work with other work (unlike the transfers from actors to learners, which can be overlapped with both acting and learning). The amount of data that needs to be meaned is: (bits per gradient) x (num parameters).</p>

<p>Let’s say each device has bandwidth of R bits / sec and the gradients take S bits. Assuming the mean is calculated and sent back using a reduce-scatter and then all-gather, the time taken is:</p>

<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mfrac><mrow><mn>2</mn><mi>S</mi></mrow><mi>R</mi></mfrac></mrow><annotation encoding="application/x-tex">\frac{2S}{R}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:2.0463em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3603em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.00773em;">R</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span></span></span></span></span>

<p>Now let’s try to get some sensible values of R and S.
Most online RL research uses relatively few parameters compared to modern LLMs (e.g. Dreamer v3 XL has 300 million parameters, the unusually large Gato has 1.2 billion).
To work an example, let’s say we use 16 bits per gradient x 1 billion parameters = 16 Gbits. The TPU v6e has R = 3584 Gbps of inter-chip interconnect bandwidth, which gets us:</p>

<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mfrac><mrow><mn>2</mn><mo>×</mo><mn>16</mn></mrow><mn>3584</mn></mfrac><mo>=</mo><mn>0.009</mn><mi>s</mi></mrow><annotation encoding="application/x-tex">\frac{2 \times 16}{3584} = 0.009s</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:2.0074em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">3584</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">16</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.009</span><span class="mord mathnormal">s</span></span></span></span></span>

<p>To answer how special TPU pods are, let’s compare this to NVidia GPUs. NVidia’s GB200 can connect up to 72 GPUs at 1800 Gbps. The same reduction would take</p>

<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mfrac><mrow><mn>2</mn><mo>×</mo><mn>16</mn></mrow><mn>1800</mn></mfrac><mo>=</mo><mn>0.018</mn><mi>s</mi></mrow><annotation encoding="application/x-tex">\frac{2 \times 16}{1800} = 0.018s</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:2.0074em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">1800</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">2</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mord">16</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.018</span><span class="mord mathnormal">s</span></span></span></span></span>

<p>Roughly twice as long, but in order to determine how much of an impact this makes on training throughput we’d need to look at a particular example which depends heavily on hyperparameters. It seems the TPU networking is still higher bandwidth, but for workloads that fit within an NVLink switch, the impact on training throughput may be quite small.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Years ago only DeepMind and OpenAI could do distributed RL at scale. Today, thanks to the libraries, APIs, on-demand cloud computing, and knowledge that is available, it’s within reach of a very small team (like me!).</p>

<h2 id="acknowledgements">Acknowledgements</h2>

<p>I started Earl while working at the Astera Institute, though I didn’t implement distributed training until after I left.
I thank Jed McCaleb for agreeing to let me open-source it.
My coworkers at Astera contributed to Earl early on: Andrew Grebenisan, Mick van Gelderen and Eric Alt.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>I use these terms for consistency with the paper, which in no way should be read as my endorsement of The Phantom Menace. Though I did enjoy the <a href="https://en.wikipedia.org/wiki/Star_Wars_Episode_I:_Racer">Racer</a> game. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>garymm</name></author><category term="programming" /><category term="machine learning" /><summary type="html"><![CDATA[In this post I will briefly describe Earl, a reinforcement learning (RL) framework I wrote that enables scalable distributed training across multiple devices, and discuss some of the things I learned along the way.]]></summary></entry><entry><title type="html">Starflate: Deflate decompression in C++23</title><link href="http://garymm.org/blog/2025/01/31/starflate/" rel="alternate" type="text/html" title="Starflate: Deflate decompression in C++23" /><published>2025-01-31T00:00:00-08:00</published><updated>2025-01-31T00:00:00-08:00</updated><id>http://garymm.org/blog/2025/01/31/starflate</id><content type="html" xml:base="http://garymm.org/blog/2025/01/31/starflate/"><![CDATA[<p>In this post I describe some things I learned while working on <a href="https://github.com/garymm/starflate">Starflate</a>, an implementation of Deflate decompression in C++23 that I wrote with my friend <a href="https://github.com/oliverlee">Oliver Lee</a>.</p>

<p>Deflate is a compression codec used in GZip, Zip, PNG and other formats. I wanted to get hands-on with GPU programming and I decided implementing Deflate decompression would be a fun way to do that. After finishing the CPU-only implementation, I realized that there is no way to efficiently parallelize it, so I will have to find something else to use to learn GPU programming. But along the way I did learn quite a bit about compression and C++.</p>

<h2 id="deflate-decompression">Deflate decompression</h2>

<p>I think this diagram does a pretty good job of showing the different layers in the Deflate compression algorithm:</p>

<p><img class="wrap" src="/generated/2025-01-31-deflate-800-0c3ca2880.png" alt="deflate compression layers" srcset="/generated/2025-01-31-deflate-400-635bef326.webp 400w, /generated/2025-01-31-deflate-600-635bef326.webp 600w, /generated/2025-01-31-deflate-800-635bef326.webp 800w, /generated/2025-01-31-deflate-1000-635bef326.webp 1000w" /></p>

<p>Figure 4 from <a href="https://doi.org/10.1002/cpe.7454">Takafuji et. al, 2022</a></p>

<p>The innermost layer is LZSS, in which the input is a series of either:</p>

<ul>
  <li>A “literal”, meaning just copy this byte to the output, or</li>
  <li>A length and backwards-distance pair (l, d),  meaning copy l bytes starting from output[-d] to output.</li>
</ul>

<p>The second layer is an encoding scheme for the length and distance pairs that doesn’t seem to have a name, but is shown in the diagam as “deflate” format. The deflate standard defines a code table for distances and another for lengths. This steps to decode goes something like:</p>

<ol>
  <li>Look up the code in the table. This gives a base value and a number of extra bits to read from the input.</li>
  <li>Read those extra bits from the input, interpret them as an integer.</li>
  <li>Add the integer to the base value.</li>
</ol>

<p>The outermost layer is Huffman coding. I won’t do a better job than <a href="https://en.wikipedia.org/wiki/Huffman_coding">Wikipedia</a> of explaining it, but basically it’s a provably optimal (as in maximally compact) prefix-free coding scheme (meaning the code for any symbol is not a prefix of the code for any other symbol).</p>

<p>Finally, there is the added complexity that the Huffman code tables themselves can be included in the compressed data, and they are encoded using a scheme similar to the second-layer (“deflate” coding) scheme (but slightly different).</p>

<h2 id="starflate-design">Starflate design</h2>

<p>The <a href="https://github.com/garymm/starflate/blob/289b78afa5aa93f0971fcee9f5d17d3bf0a93dd2/src/decompress.cpp">core implementation of decompression</a> is 391 lines of code (excluding comments and blank lines), and I think it’s relatively readable. However, there are another ~1300 lines of code in helper libraries we wrote for dealing with bit streams and Huffman coding. These helper libraries allowed the main code to stay quite short and readable.</p>

<h3 id="bit_span">bit_span</h3>

<p><a href="https://github.com/garymm/starflate/blob/289b78afa5aa93f0971fcee9f5d17d3bf0a93dd2/huffman/src/bit_span.hpp">bit_span</a> is like std::span in that it is a non-owning view of a contiguous extent of the same type of data. Unlike span, bit_span allows its users to iterate over individual bits, even though the underlying data is stored as bytes.</p>

<h3 id="huffmantable">huffman::table</h3>

<p><a href="https://github.com/garymm/starflate/blob/289b78afa5aa93f0971fcee9f5d17d3bf0a93dd2/huffman/src/table.hpp">huffman::table</a> is a Huffman code table. For ease of testing it has a bunch of different constructors, but the only one used in decompression is the one that takes a range of pairs of (symbol range, bitsize). Huffman coding uses prefix-free codes, meaning that we decode the input one bit at a time and we’re done as soon as we find the bit pattern in the table. Internally the table stores things sorted lexicographically, which allows for efficient decoding by keeping track of where we are in the table in between attempts to decode the bits. In an attempt to be idiomatic C++, the table exposes iterators with the standard begin() and end() methods. The main use of the table class is in <a href="https://github.com/garymm/starflate/blob/289b78afa5aa93f0971fcee9f5d17d3bf0a93dd2/huffman/src/decode.hpp#L74">decode_one</a>, the pseudo-code for which is:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">decode_one</span><span class="p">(</span><span class="n">huffman_table</span><span class="p">,</span> <span class="n">bits</span><span class="p">):</span>
   <span class="n">table_pos</span> <span class="o">=</span> <span class="n">huffman_table</span><span class="p">.</span><span class="nf">begin</span><span class="p">()</span> <span class="c1"># iterator
</span>   <span class="n">current_code</span> <span class="o">=</span> <span class="mi">0</span>
   <span class="k">for</span> <span class="n">bit</span> <span class="ow">in</span> <span class="n">bits</span><span class="p">:</span>
      <span class="n">current_code</span> <span class="o">=</span> <span class="p">(</span><span class="n">current_code</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">|</span> <span class="n">bit</span>
      <span class="n">std</span><span class="p">::</span><span class="n">expected</span><span class="o">&lt;</span><span class="n">table</span><span class="p">::</span><span class="n">iterator</span><span class="p">,</span> <span class="n">table</span><span class="p">::</span><span class="n">iterator</span><span class="o">&gt;</span> <span class="n">found</span> <span class="o">=</span> <span class="n">huffman_table</span><span class="p">.</span><span class="nf">find</span><span class="p">(</span>
        <span class="n">current_code</span><span class="p">,</span> <span class="n">table_pos</span><span class="p">)</span>
      <span class="k">if</span> <span class="n">found</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">found</span><span class="o">-&gt;</span><span class="n">symbol</span>
      <span class="n">table_pos</span> <span class="o">=</span> <span class="n">found</span><span class="p">.</span><span class="nf">error</span><span class="p">()</span> <span class="c1"># uses expected::error to hold the next iterator position
</span>      <span class="k">if</span> <span class="n">table_pos</span> <span class="o">==</span> <span class="n">huffman_table</span><span class="p">.</span><span class="nf">end</span><span class="p">():</span>
        <span class="k">return</span> <span class="nf">error</span><span class="p">()</span>
  <span class="k">return</span> <span class="nf">error</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="c-features--patterns">C++ features / patterns</h2>

<p>A few C++ features or patterns I learned along the way. Thanks to Oliver for teaching me all these (and more that didn’t stick!).</p>

<h3 id="constexpr">constexpr</h3>

<p>My biggest C++ lesson learned is that a large fraction of the language and library features are compatible with constexpr, meaning they can be evaluated at compile time. While there are potential runtime performance benefits to this, it’s also cool that this guarantees the code contains no undefined behavior. That is, this is a way to convert potential runtime errors into compile-time errors. This is the one feature of C++ that I actually missed when writing Rust recently.</p>

<h3 id="stdexpected">std::expected</h3>

<p>Added in C++23, std::expected contains either an expected or an error value. It’s a sane way of propagating errors and we used it extensively. This is one of those things that I didn’t notice was missing from the language when I worked at Google because Google had its own version. Actually the standard library version is better because both the expected and error types can be templated, which we took advantage of for huffman::table::find’s return type.</p>

<h3 id="the-overload-pattern-pattern-matching">The overload pattern: pattern matching</h3>

<p>You can combine std::variant, std::visit, and the overload pattern to get something like Rust’s pattern matching.
We used this <a href="https://github.com/garymm/starflate/blob/289b78afa5aa93f0971fcee9f5d17d3bf0a93dd2/src/decompress.cpp#L227">here</a> to dispatch to different code paths depending on wheter we decoded a literal byte to be copied to the output, or length of previous output to be copied. The syntax for it is terrible though. This example from <a href="https://www.cppstories.com/2019/02/2lines3featuresoverload.html/">C++ Stories</a> is a good one:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span><span class="o">&lt;</span><span class="k">class</span><span class="o">...</span> <span class="nc">Ts</span><span class="p">&gt;</span> <span class="k">struct</span> <span class="nc">overload</span> <span class="o">:</span> <span class="n">Ts</span><span class="p">...</span> <span class="p">{</span> <span class="k">using</span> <span class="n">Ts</span><span class="o">::</span><span class="k">operator</span><span class="p">()...;</span> <span class="p">};</span>

<span class="n">std</span><span class="o">::</span><span class="n">variant</span><span class="o">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">float</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&gt;</span> <span class="n">intFloatString</span> <span class="p">{</span> <span class="s">"Hello"</span> <span class="p">};</span>
<span class="n">std</span><span class="o">::</span><span class="n">visit</span><span class="p">(</span><span class="n">overload</span>  <span class="p">{</span>
      <span class="p">[](</span><span class="k">const</span> <span class="kt">int</span><span class="o">&amp;</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"int: "</span> <span class="o">&lt;&lt;</span> <span class="n">i</span><span class="p">;</span> <span class="p">},</span>
      <span class="p">[](</span><span class="k">const</span> <span class="kt">float</span><span class="o">&amp;</span> <span class="n">f</span><span class="p">)</span> <span class="p">{</span> <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"float: "</span> <span class="o">&lt;&lt;</span> <span class="n">f</span><span class="p">;</span> <span class="p">},</span>
      <span class="p">[](</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span> <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"string: "</span> <span class="o">&lt;&lt;</span> <span class="n">s</span><span class="p">;</span> <span class="p">}</span>
    <span class="p">},</span>
    <span class="n">intFloatString</span>
<span class="p">);</span>
</code></pre></div></div>

<h3 id="template-deduction-guide">Template deduction guide</h3>

<p>A template deduction guide is some code that one can add to a templated function or class that tells the compiler how to fill in template arguments. This can make using the templated function or class much more readable.</p>

<p>This example from <a href="https://en.cppreference.com/w/cpp/language/class_template_argument_deduction#User-defined_deduction_guides">cppreference</a> is a good one:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// declaration of the template</span>
<span class="k">template</span><span class="o">&lt;</span><span class="k">class</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">struct</span> <span class="nc">container</span>
<span class="p">{</span>
    <span class="n">container</span><span class="p">(</span><span class="n">T</span> <span class="n">t</span><span class="p">)</span> <span class="p">{}</span>

    <span class="k">template</span><span class="o">&lt;</span><span class="k">class</span> <span class="nc">Iter</span><span class="p">&gt;</span>
    <span class="n">container</span><span class="p">(</span><span class="n">Iter</span> <span class="n">beg</span><span class="p">,</span> <span class="n">Iter</span> <span class="n">end</span><span class="p">);</span>
<span class="p">};</span>

<span class="c1">// additional deduction guide</span>
<span class="k">template</span><span class="o">&lt;</span><span class="k">class</span> <span class="nc">Iter</span><span class="p">&gt;</span>
<span class="n">container</span><span class="p">(</span><span class="n">Iter</span> <span class="n">b</span><span class="p">,</span> <span class="n">Iter</span> <span class="n">e</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">container</span><span class="o">&lt;</span><span class="k">typename</span> <span class="n">std</span><span class="o">::</span><span class="n">iterator_traits</span><span class="o">&lt;</span><span class="n">Iter</span><span class="o">&gt;::</span><span class="n">value_type</span><span class="o">&gt;</span><span class="p">;</span>

<span class="c1">// uses</span>
<span class="n">container</span> <span class="nf">c</span><span class="p">(</span><span class="mi">7</span><span class="p">);</span> <span class="c1">// OK: deduces T=int using an implicitly-generated guide</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">double</span><span class="o">&gt;</span> <span class="n">v</span> <span class="o">=</span> <span class="p">{</span><span class="cm">/* ... */</span><span class="p">};</span>
<span class="k">auto</span> <span class="n">d</span> <span class="o">=</span> <span class="n">container</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">v</span><span class="p">.</span><span class="n">end</span><span class="p">());</span> <span class="c1">// OK: deduces T=double</span>
</code></pre></div></div>

<h3 id="missing-feature-iterator_interface">Missing feature: iterator_interface</h3>

<p>A lot of the standard library exposes and operates on iterators, so it’s nice to also do so when writing custom data structures. Quoting the <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2727r4.html">proposal for adding std::iterator_interface</a>: “Writing STL iterators is surprisingly hard. There are a lot of things that can subtly go wrong. It is also very tedious, which of course makes it error-prone.” We needed a couple of iterators (one for bit_span and one for huffman::table), so we added a simple version of the <a href="https://github.com/garymm/starflate/blob/289b78afa5aa93f0971fcee9f5d17d3bf0a93dd2/huffman/src/detail/iterator_interface.hpp">iterator_interface</a>. There is an implementation in <a href="https://www.boost.org/doc/libs/1_87_0/doc/html/boost_stlinterfaces/tutorial___iterator_interface_.html">Boost</a>, but we wanted to avoid any external dependencies.</p>

<h2 id="setting-up-c-is-horrible">Setting up C++ is horrible</h2>

<p>C++ takes an insane amount of set-up to get a repository that has what are extremely easy in other languages. We set the following up, and none of it is really standard or easy to do:</p>

<ul>
  <li>Build system. One has to choose between Make, CMake, Meson, Bazel, etc. We chose Bazel because it’s good and we’re used to it, but it takes a lot of work to set up, it’s poorly documented, and less common combinations of features (like C++ test coverage with Clang) have been broken.</li>
  <li>Hermetic toolchain, meaning inside the repo we define what versions of Clang, GCC, etc we want to use, rather than relying on whatever is installed on the system.</li>
  <li>Sanitizers. E.g. thread sanitizer, address sanitizer, undefined behavior sanitizer. These are compilation modes that instrument the code and fail if the code does something bad. Address sanitizer and undefined behavior sanitizer aren’t needed for most other languages, but I think it’s pretty insane to write C++ without them.</li>
  <li>Static analysis (AKA linting). Basically turn on all the compiler warnings and treat them as errors (pretty insane that this is not the default for most of the warnings), and also run clang-tidy. Running clang-tidy through bazel is not straightforward but Oliver figured it out.</li>
  <li>Autoformatting. Again insane that this is not the default, and one needs to do extra work to get it configured in editors and enforced in CI.</li>
  <li>Bringing in third party dependencies is horrible, and you need at least some third party dependencies because the standard library doesn’t include a unit test library.</li>
</ul>]]></content><author><name>garymm</name></author><category term="programming" /><category term="cpp" /><summary type="html"><![CDATA[In this post I describe some things I learned while working on Starflate, an implementation of Deflate decompression in C++23 that I wrote with my friend Oliver Lee.]]></summary></entry><entry><title type="html">Assembling an infrastructure for machine learning research</title><link href="http://garymm.org/blog/2025/01/27/assembling-ml-exp-infra/" rel="alternate" type="text/html" title="Assembling an infrastructure for machine learning research" /><published>2025-01-27T00:00:00-08:00</published><updated>2025-01-27T00:00:00-08:00</updated><id>http://garymm.org/blog/2025/01/27/assembling-ml-exp-infra</id><content type="html" xml:base="http://garymm.org/blog/2025/01/27/assembling-ml-exp-infra/"><![CDATA[<p>While working on machine learning research at the Astera Institute<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, I led a team that assembled a system that enabled researchers to quickly and easily run experiments that used up to a full datacenter’s worth of GPUs. I intentionally wrote “assemble” rather than “build”, because the system mostly consists of off-the-shelf components. The challenge was in digging through the huge number of options for each possible piece of functionality, selecting appropriately, gluing things together into a working system, and designing an easy but powerful interface. I’m proud of how little code we wrote relative to how much functionality the system provides.</p>

<p>Let’s start with the core functionality that the system provided:</p>

<ul>
  <li>Takes code from the user’s git branch.</li>
  <li>Runs the code on a cluster with a specifiable number of trials running in parallel, and with each trial using a specifiable number of GPUs.</li>
  <li>Executes a     search using a specifiable search space and algorithm, or multiple trials with different random seeds to get a distribution of results for a fixed set of hyperparameters.</li>
  <li>Persists the results and makes them viewable in a web UI, or via CSV files.</li>
</ul>

<p>Now let’s go into how this is all implemented!</p>

<h2 id="the-components">The components</h2>

<h3 id="hardware">Hardware</h3>

<p>We rented hardware from Voltage Park, who at the time only offered exactly one thing: bare-metal servers running Ubuntu with 8 Nvidia GPUs. Because of pre-existing contracts, we didn’t have any choice in hardware or cloud providers. This constrained the design space for some of the other parts of the system.</p>

<h3 id="kubernetes-on-talos-hardware-abstraction-and-workload-orchestration">Kubernetes on Talos: Hardware abstraction and workload orchestration</h3>

<p>Just as an operating system abstracts over CPUs and RAM on a computer and manages the life cycles of processes, Kubernetes abstracts over all the hardware in a cluster and manages the life cycles of containers.</p>

<p>We ended up using <a href="https://www.talos.dev">Talos</a>, a Linux distribution that includes Kubernetes. Overall we were really happy with that choice. It’s well-designed, well-documented, and well-supported.</p>

<h4 id="the-journey">The journey</h4>

<p>While there are alternatives, Kubernetes is by far the most popular system in this category and that brings with it a huge ecosystem of tools, services, patterns and documentation, so for me it was an easy choice.</p>

<p>The difficult thing was figuring out how to run it. Big cloud providers like Azure or CoreWeave provide a managed Kubernetes service. Because we were tied to Voltage Park, managed Kubernetes services weren’t an option. There are many ways to run Kubernetes on your own. I initially picked Kubespray because it was mentioned in the official Kubernetes documentation and it was built on Ansible, which we were already using. While I did successfully run a cluster using Kubespray, I was not satisfied:</p>

<ul>
  <li>Creating or modifying a cluster is very slow. Like 30 minutes to apply a configuration change to a 6 node cluster.</li>
  <li>Doing normal things would often fail and leave the cluster in some unknown and probably invalid state that was very difficult to recover from. Because Kubespray makes changes to the underlying node OS, but it doesn’t take full responsibility for the OS the way Talos does, getting back into a good state would require reinstalling the OS and then re-running Kubespray, which would take over an hour.</li>
  <li>We had mysterious and hard to debug issues with the GPUs becoming inaccessible, which we worked around by rebooting the nodes.</li>
  <li>There is no paid support option and I couldn’t resolve the above issues using the free community-provided support (including the patchy documentation).</li>
</ul>

<p>In contrast, Talos:</p>

<ul>
  <li>Takes much less time to apply configuration (sorry I don’t remember the timing, but the fact that I don’t remember means it was not a big deal!).</li>
  <li>Installs the operating system in a pre-configured state such that it is ready to be part of the Kubernetes cluster, and the OS is immutable (read-only after installation), so it is much less likely to end up in weird states.</li>
  <li>Had no such flaky GPU issues.</li>
  <li>Has paid support options and excellent response times to community-reported bugs.</li>
</ul>

<p>The main challenge we had with Talos is that our cloud provider did not give us a way to install a custom OS. After first trying to run it inside a VM inside the Ubuntu host, we ended up finding a way to overwrite Ubuntu with Talos from within Ubuntu! This meant we could run Talos on bare metal.</p>

<h3 id="distribution-registry-container-image-hosting">Distribution Registry: Container image hosting</h3>

<p>Container images are the unit of distribution for code that runs on Kubernetes. A container registry is a service that stores and allows clients to upload and download container images. There are many options for cloud-hosted managed container registries, but we wanted our images to be stored on the same local network as our Kubernetes nodes in order to maximize bandwidth when downloading (AKA “pulling”) images. So we ran <a href="https://distribution.github.io/distribution/about/">Distribution Registry</a> inside our cluster.</p>

<h4 id="the-journey-1">The journey</h4>

<p>The main difficulties with self-hosting the registry was in configuring network access. Our requirements:</p>

<ul>
  <li>The registry is accesible via HTTPS to pods inside the cluster. This is needed because Katib (discussed below) <a href="https://github.com/kubeflow/katib/blob/5723604d419c5ba5bf01240b7be5ebf55aaee0bc/pkg/webhook/v1beta1/pod/utils.go#L63">fetches image metadata directly from the registry</a> and there is no easy way to tell it to connect without HTTPS.</li>
  <li>Image pushes and pulls do not go through Tailscale, since that reduces bandwidth and our images are pretty large.</li>
</ul>

<p>We ended up with the following solution:</p>

<ul>
  <li>A Tailscale name is used for the registry. We configured Tailscale to automatically generate an SSL certificate so connections over HTTPS work.</li>
  <li>We configured our cluster’s DNS to forward requests for .ts.net domains to an in-cluster Tailscale DNS. So connections from inside a pod inside the cluster also go through Tailscale.</li>
  <li>We configured our containerd, which is responsible for pulling images when starting containers, to treat the registry’s Tailscale domain name as an alias of the registry’s in-cluster .svc.cluster.local name, thus bypassing Tailscale encryption and maintaining fast image pulls.</li>
  <li>We configured Kaniko (discussed below) to push to the registry through its .svc.cluster.local name, thus bypassing Tailscale ant maintaining fast image pushes.</li>
</ul>

<h3 id="kaniko-in-cluster-image-building">Kaniko: In-cluster image building</h3>

<p><a href="https://github.com/GoogleContainerTools/kaniko">Kaniko</a> takes in a git repository URL and revision and a path to a Dockerfile within the repository, and it builds an image according to the Dockerfile and pushes it to our Distribution Registry. This is how a user’s code gets into a container image.</p>

<h4 id="the-journey-2">The journey</h4>

<p>We started building images locally on the user’s computer and then pushing the image from there to the cluster. This worked, but due to some large dependencies (e.g. PyTorch alone is over 900 MB), any push of the image layer that contained the dependencies was very slow. Since the actual code being modified (i.e. the git repo) was much smaller, it made sense to upload only that from the user’s computer to the cluster, and then build the image in the cluster and then the push to the registry would happen over a fast local connection. This does require users to commit and push their code to git before starting an experiment, but that is a good practice anyways.</p>

<h3 id="katib-multi-trial-experiment-orchestration">Katib: Multi-trial experiment orchestration</h3>

<p><a href="https://www.kubeflow.org/docs/components/katib/">Katib</a> is a system for running distributed hyperparameter search on Kubernetes. A “search” over different random seeds can be used as a way to get a distribution of results for a fixed set of hyperpartmeters. Katib is very flexible but that flexibility means it requires a lot of configuration for each experiment. We were able to simplify the user experience dramatically through a mix of automation and convention. The main things that Katib needs to know are:</p>

<ul>
  <li>The search space. We require the user to write this in a YAML file.</li>
  <li>The metrics to optimize. We require the user to write this in the same YAML file.</li>
  <li>How to run an individual trial. E.g., container image to use, how many GPUs it needs. The Launch tool (described below) handles this automatically.</li>
  <li>How to pass in hyperparameter values for a trial. The Launch tool handles this automatically, by assuming the user’s code follows the convention of using Draccus (described below) or something compatible for command line parsing.</li>
  <li>How to extract metrics from a trial. The Launch tool handles this automatically, by assuming the user’s code follows the convention of using writing its metrics in Tensorboard format to the path specified in the --tensorboard_dir command line arg.</li>
</ul>

<h4 id="the-journey-3">The journey</h4>

<p>Before Katib, we tried Ray Tune. The things we liked less about Ray Tune than Katib:</p>

<ul>
  <li>The APIs and documentation are a mess. In contrast, while Katib’s documentation is very incomplete, the repo contains lots of examples that are pretty instructive, and the APIs are much more intuitive.</li>
  <li>Ray Tune requires writing an imperative Python file using the aforementioned confusing APIs for every search. It’s much easier to check the validity of a static YAML file that configures Katib than to check for all the ways Python code might be wrong.</li>
  <li>Ray Tune seemed to require more restructuring of the researcher’s code.</li>
  <li>The only way to track progress is via terminal output (whereas Katib has a nice web UI), and even totally correct use of Ray Tune results in massive amounts of warnings and useless messages being printed.</li>
</ul>

<h3 id="draccus-training-code-configuration-specification">Draccus: Training code configuration specification</h3>

<p><a href="https://github.com/dlwh/draccus">Draccus</a> is a simple Python library for defining and parsing configuration using dataclasses. The key thing that our system requires of the training code used for a trial is that a hyperparamter named “foo” is accepted and parsed via the command line flag --foo. This lets the Launch tool translate mechanically between the search space the user wrote in YAML and the command line for a trial.</p>

<h4 id="the-journey-4">The journey</h4>

<p>The main alternative I considered was Hydra. Hydra seems to have a superset of the functionality in Draccus but the added complexity of all of the options didn’t seem worth the benefits. Due to the modular system design, it would be easy to switch later if the team decides Hydra is needed.</p>

<h3 id="mlflow-experiment-metric-tracking">MLflow: Experiment metric tracking</h3>

<p>While Katib tracks the metrics being optimized in a search, there are many other metrics that can be useful to analyze, and having a UI to visualize metrics throughout a trial and compare them across experiments is really useful. Storing artifacts like videos of an agent interacting with an RL environment is also key for understanding training progress. For this we used <a href="https://mlflow.org/docs/latest/tracking.html">MLflow Tracking</a>, a service to track metrics and store artifacts.</p>

<h4 id="the-journey-5">The journey</h4>

<p>The main alternative I considered was Weights &amp; Biases. While they have very similar sets of features, we ended up choosing MLflow because:</p>

<ul>
  <li>It has documented HTTP APIs, meaning one can interact with it from any language. I didn’t want to be forced to use Python for all the tooling that might want to interact with our experiment metrics.</li>
  <li>It can be much cheaper. Databricks doesn’t make this clear, but if you provide your own storage (e.g. S3 bucket), they will host MLflow tracking for free. Or you can self-host for free.</li>
</ul>

<p>Once we had settled on MLflow, the main challenge was finding and enforcing a convention on how to organize experiments and runs so people can find what they need. While I’m not confident this is the best solution, we ended up writing a small wrapper over the MLflow Python client that sets the experiment name and run name to match the Katib experiment and trial name. It gets this Katib metadata from environment variables set by the Launch tool. This at least makes it easy to go between the two systems.</p>

<h3 id="tailscale-secure-remote-access">Tailscale: Secure remote access</h3>

<p>In order to create and monitor experiments, users need to have access to Katib and other services running in our cluster. We used <a href="https://tailscale.com">Tailscale</a> for this. It provides network encryption and DNS like a VPN but all the connections are peer-to-peer rather than forcing everything through a single VPN server. It works great and integrated seamlessly with our Google workspace accounts.</p>

<h3 id="launch-user-cli-that-glues-it-all-together">Launch: User CLI that glues it all together</h3>

<p>The only part of this that we wrote ourselves is a tool called Launch. It is <a href="https://github.com/Astera-org/launch">open source</a> and written in Rust. Launch glues everything together. It takes in:</p>

<ul>
  <li>A YAML file specifying the search space and to-be-optimized metrics.</li>
  <li>A --gpus flag, which specifies the number of GPUs per trial.</li>
  <li>A command to run for each trial.</li>
</ul>

<p>And it:</p>

<ul>
  <li>Triggers a build of a container image of the current git branch via Kaniko.</li>
  <li>Constructs a full Katib experiment spec. In addition to the info from the user’s YAML file, it tells Katib to pass hyperparameter values via command line args according to the conventions (described above in the Katib section), and it adds the --tensorboard_dir arg.</li>
  <li>Creates the Katib experiment.</li>
  <li>Prints URLs of Katib and MLflow UI pages for the experiment.</li>
  <li>Polls the cluster to check that the experiment starts and runs a trial succesfully.</li>
</ul>

<h4 id="the-journey-6">The journey</h4>

<p>The main question I struggled with was what language to use to implement the tool. Due to the available libraries for interacting with Kubernetes and Katib, Go was the obvious choice. The factors in favor of Rust were pre-existing expertise on the team and me thinking it would be more fun in Rust. The availability of libraries was almost decisive in Go’s favor until we discovered we could use the OpenAPI Generator to generate Rust client libraries for <a href="https://github.com/Astera-org/kubernetes-client-rust">Kubernetes</a> and <a href="https://github.com/Astera-org/katib-client-rust">Katib</a>. Compared to Go, some of the nicest things about Rust are the power of the Serde library for deserializing configuration files and the error handling syntax (writing <code class="language-plaintext highlighter-rouge">foo()?</code> is so much nicer than <code class="language-plaintext highlighter-rouge">if err := foo(); err != nil { return err }</code> ).</p>

<h2 id="system-diagram">System diagram</h2>

<p>Notes:</p>

<ul>
  <li>In reality all of this could be running on a single server, or distributed as shown, or something in between. Kubernetes handles the scheduling dynamically.</li>
  <li>Not shown, but in addition to the depicted MLflow upload, user code also writes metrics to a local directory in Tensorboard format, which is what Katib monitors.</li>
</ul>

<p><img class="wrap" src="/generated/2025-01-27-obelisk-infra-diagram-800-51182cbce.png" alt="cruising with my team" srcset="/generated/2025-01-27-obelisk-infra-diagram-400-4868e216e.webp 400w, /generated/2025-01-27-obelisk-infra-diagram-600-4868e216e.webp 600w, /generated/2025-01-27-obelisk-infra-diagram-800-4868e216e.webp 800w, /generated/2025-01-27-obelisk-infra-diagram-1000-4868e216e.webp 1000w" /></p>

<h2 id="what-could-be-improved">What could be improved</h2>

<p>The biggest thing that I wish I could have improved before I left was the latency of building and pushing container images. While Kaniko is supposed to support caching, we weren’t able to get it working, so every time a user launches an experiment Kaniko would take a few minutes to rebuild the entire image (unless they didn’t change any code at all, in that case we would re-use a previously built image). The solution I wanted to try was to build and push the images using Bazel, which has many options for caching and would also allow us to have very fine-grained control over the image to optimize it for build speed. In particular, Bazel should make it possible to have one image layer per python package in our dependencies, so if a single dependency changes we wouldn’t need to rebuild and push a single huge layer that has all of our dependencies.</p>

<p>Another thing I wanted to do was to modify the Katib UI to allow adding a link from Katib to MLflow. This is hopefully a simple change.</p>

<p>Finally there are things which we didn’t implement only because we didn’t need them, but which I expected to need at some point. These include queuing experiment trials according to a priority (which I planned to implement via Kueue) and multi-machine trials (which I planned to implement via Kubeflow Training Operator).</p>

<h2 id="credits">Credits</h2>

<p>Matthew Behrens and Mick van Gelderen helped a lot with many aspects. Among other things, Matt actually got Talos running, including figuring out how to install it from inside Ubuntu and finding versions of Talos and the Nvidia system extensions that worked with our hardware, and Mick implemented most of the Launch tool, discovered Kaniko and proved it could work.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>This work was done in 2024. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>garymm</name></author><category term="machine learning" /><category term="programming" /><summary type="html"><![CDATA[While working on machine learning research at the Astera Institute1, I led a team that assembled a system that enabled researchers to quickly and easily run experiments that used up to a full datacenter’s worth of GPUs. I intentionally wrote “assemble” rather than “build”, because the system mostly consists of off-the-shelf components. The challenge was in digging through the huge number of options for each possible piece of functionality, selecting appropriately, gluing things together into a working system, and designing an easy but powerful interface. I’m proud of how little code we wrote relative to how much functionality the system provides. This work was done in 2024. &#8617;]]></summary></entry><entry><title type="html">JAX and Equinox: What are they and why should I bother?</title><link href="http://garymm.org/blog/2024/09/08/jaxwhat/" rel="alternate" type="text/html" title="JAX and Equinox: What are they and why should I bother?" /><published>2024-09-08T00:00:00-07:00</published><updated>2024-09-08T00:00:00-07:00</updated><id>http://garymm.org/blog/2024/09/08/jax-equinox-what-and-why</id><content type="html" xml:base="http://garymm.org/blog/2024/09/08/jaxwhat/"><![CDATA[<p>This post is written as a Jupyter notebook which you can run and edit using the link below:</p>

<p><a href="https://githubtocolab.com/garymm-org/garymm-org.github.io/blob/master/assets/jax-equinox-what-and-why.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>

<div class="jupyter-notebook" style="position: relative; width: 100%; margin: 0 auto;">
  <div class="jupyter-notebook-iframe-container">
    <iframe src="/assets/jax-equinox-what-and-why.ipynb.html" style="position: absolute; top: 0; left: 0; border-style: none;" width="100%" height="100%" onload="this.parentElement.style.paddingBottom = (this.contentWindow.document.documentElement.scrollHeight + 10) + 'px'"></iframe>
  </div>
</div>]]></content><author><name>garymm</name></author><category term="machine learning" /><category term="programming" /><summary type="html"><![CDATA[This post is written as a Jupyter notebook which you can run and edit using the link below:]]></summary></entry><entry><title type="html">Using Fidelity as a checking account to 10x your yield</title><link href="http://garymm.org/blog/2024/08/31/fidelity/" rel="alternate" type="text/html" title="Using Fidelity as a checking account to 10x your yield" /><published>2024-08-31T00:00:00-07:00</published><updated>2024-08-31T00:00:00-07:00</updated><id>http://garymm.org/blog/2024/08/31/fidelity</id><content type="html" xml:base="http://garymm.org/blog/2024/08/31/fidelity/"><![CDATA[<p>You can use an account at Fidelity as a checking account, meaning you can write and deposit checks and withdraw cash from ATMs.
Why do this?</p>

<h2 id="higher-yield">Higher yield</h2>

<p>At Fidelity you can get much higher yield on your money without sacrificing liquidity. E.g., the Schwab checking account I used before switching to Fidelity currently pays 0.45%. The money in the equivalent account at Fidelity had an annualized yield of 4.96% last week.</p>

<p>To get that higher yield, Fidelity will invest your money in US treasury securities. The actual rate varies depending on the market. There are probably times when the yield will be lower than what you can get in a checking account. For example in 2021, Fidelity’s <a href="https://fundresearch.fidelity.com/mutual-funds/performance-and-risk/31617H102">Government Money Market Fund earned 0.01%</a> whereas the average the <a href="https://ycharts.com/indicators/us_interest_checking_account_rate">checking account was paying 0.03%</a>. However on this low end, the absolute difference is negligible, but when interest rates rise, the difference is huge (e.g. the current 4.96% vs 0.45%).</p>

<p>Some consider treasuries riskier than an FDIC-insured checking account. Personally I think the odds losing money in both are very similar. They both involve the USA government defaulting on its obligations (treasury debts in one case, FDIC insurance in another). Fidelity does offer an FDIC-insured investment that is currently paying 2.72%, so ~5x what a checking account pays with the same risk.</p>

<h2 id="account-consolidation">Account consolidation</h2>

<p>You can choose to consolidate several accounts (checking, retirement investments, non-retirement investments, etc) at Fidelity and thus have one fewer financial institution to deal with. I personally have spending money (what I used to have in a checking account), non-retirement investments, a health savings account, and a retirement account there.</p>

<h2 id="how-to-set-it-up">How to set it up</h2>

<p>There’s two ways: use a brokerage account or a cash management account.
Here’s how they compare along the main axes I care about:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>brokerage account</th>
      <th>cash management account</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ATM fees reimbursed</td>
      <td>only if you have &gt;$250k across all Fidelity accounts</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>FDIC insured option</td>
      <td>no</td>
      <td>yes (lower yield)</td>
    </tr>
    <tr>
      <td>use 1 account for checking + investments</td>
      <td>yes</td>
      <td>no</td>
    </tr>
  </tbody>
</table>

<p>When you open your account you’ll need to select what your “cash” holding is actually invested in. Currently the default for both is the Fidelity Government Money Market Fund (SPAXX). Then when you deposit money into your account it automatically is used to purchase shares of that holding.</p>

<p>The routing number that Fidelity gives you for making direct deposits or withdrawals is actually associated with another institution (since Fidelity is not actually a bank that offers check accounts it has to partner with another institution to access the ACH system for transfers), so don’t be alarmed if you enter the routing number and some website says it’s not Fidelity.</p>

<p>There’s lots more details over at the <a href="https://www.bogleheads.org/wiki/Fidelity:_one_stop_shop#Suggested_account_usages">Bogleheads wiki</a>. The one inaccuracy I noticed there is it says that brokerage account ATM fees are reimbursed only for “Private Client Group”, but it’s actually both Private and Premium clients and Premium is a lower threshold (&gt;$250k assets across all your Fidelity accounts, as of writing).</p>]]></content><author><name>garymm</name></author><category term="money" /><summary type="html"><![CDATA[You can use an account at Fidelity as a checking account, meaning you can write and deposit checks and withdraw cash from ATMs. Why do this?]]></summary></entry><entry><title type="html">Kagi vs Google search: a personal evaluation</title><link href="http://garymm.org/blog/2024/08/17/kagigoogle/" rel="alternate" type="text/html" title="Kagi vs Google search: a personal evaluation" /><published>2024-08-17T00:00:00-07:00</published><updated>2024-08-17T00:00:00-07:00</updated><id>http://garymm.org/blog/2024/08/17/kagi-vs-google-search</id><content type="html" xml:base="http://garymm.org/blog/2024/08/17/kagigoogle/"><![CDATA[<p><a href="https://kagi.com">Kagi</a> is a relatively new search engine. Unlike Google, it makes money through user subscriptions and shows no ads.
Despite having decreased my usage of web search since the release of ChatGPT, I still use it a lot, and would be willing to pay a few
bucks a month for a significantly better experience. To evaluate Kagi, I put 75 of my recent search queries into Kagi and Google and rated
which I preferred. The queries spanned various topics, heavily tilted towards software engineering and computer topics.</p>

<h2 id="summary">Summary</h2>

<p>After this experiment I’ve decided to pay for Kagi and set it as my default search engine on both my phone and laptop.</p>

<p>Here’s a qualitative comparison and some thoughts:</p>

<ul>
  <li>When Google shows ads on my phone, it really hurts the experience since it takes up the whole screen (often two screens of scrolling) and the ads are very very rarely relevant (with the exception of Google shopping results, which are sometimes relevant). On desktop the ads are a minor annoyance since I typically can still see the non-ad results without scrolling, and in general when I’m using my laptop I’m in less of a hurry. However maybe only 1/10 of my queries trigger non-Google-shopping ads. Probably because many of my queries are very specific and technical. As noted, Kagi doesn’t show any ads ever.</li>
  <li>Google is better at extracting relevant information (either from web results or structured data like stock prices) and putting it at the top of the search results. For Kagi this information is usually in the pages that are at or near the top, but it takes an extra click to get it. E.g. a graph of a stock’s price.</li>
  <li>Kagi shows more results from somewhat obscure, non-commercial sites and blogs. For some of my queries, these sites had excellent content that I would be very unlikely to find via Google.</li>
  <li>Google has a lot of features that I don’t care about that you might (for example, live sports scores).</li>
  <li>I didn’t thoroughly evaluate queries where I was trying to buy products online. Kagi doesn’t have a shopping search feature, and I expect I will probably continue to use Google shopping in addition to other sites to shop.</li>
</ul>

<p>Having worked at Google on search and seen how much human ingenuity and money went into building it, it’s pretty shocking
that a <a href="https://blog.kagi.com/what-is-next-for-kagi">37 person</a> (as of 2024-04) company can compete at all, but here we are!</p>

<h2 id="detailed-results">Detailed results</h2>

<ul>
  <li>Tie: 47 / 75</li>
  <li>Strongly prefer Google: 3 / 75</li>
  <li>Strongly prefer Kagi: 4 / 75</li>
  <li>Weakly prefer Google: 11 / 75</li>
  <li>Weakly prefer Kagi: 10 / 75</li>
</ul>

<h3 id="google-big-wins">Google big wins</h3>

<ul>
  <li>“bryant controlbox google home”. <a href="https://www.reddit.com/r/smarthome/comments/j32rkz/bryant_evolution_connex_connect_talking_to_other/">This reddit post</a> is the only satisfying result on either, and it’s in the first few results for Google but not for Kagi.</li>
  <li>“piedmont california front setback requirements”. Google has an “AI Overview” with the answer (which appears to have been extracted from a PDF). Kagi’s top result doesn’t have the answer on the page, though it does link to the PDF that contains the answer. It would take at least a minute of careful reading of the page that Kagi returned to figure out which link to click to get the right PDF, and then loading and searching in the PDF might take another minute.</li>
  <li>“intc stock”. Google has a nice interactive graph. Kagi has some data (like current price, 52 week range), but I like the interactive graph more.</li>
</ul>

<h3 id="kagi-big-wins">Kagi big wins</h3>

<ul>
  <li>“lugg movers”. Google starts with several ads for other companies (competitors to Lugg I assume). On my phone, I needed to scroll down two full screens to get past the ads to the actual result I wanted. Kagi had no ads, and had the official Lugg page (which is what I wanted) at the top.</li>
  <li>“josefk simt”. Kagi returned exactly what I wanted, which was <a href="https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html">this page from “yosefk.com”</a> even though I misspelled the domain name in the query. Google seems to have decided there were not enough relevant results for my whole query so it searched for just “josefk” and then noted next to each of the results that the result is “<code class="language-plaintext highlighter-rouge">Missing: simt</code>”.</li>
  <li>“how to value startup options”. On Desktop, Google starts with 4 ads which are totally irrelevant to what I wanted (though on my phone it didn’t show any ads). After that it had an AI summary which seemed reasonable and itself linked to some pretty good results. After that it had some decent web results. Kagi had no ads and had some of the same results as Google, but only Kagi had <a href="https://www.benkuhn.net/optopt/">this gem</a> from Ben Kuhn near the top. Reading that lead to lots of other relevant links on that same site.</li>
  <li>“union find algorithm”: Google’s top result is GeeksforGeeks, which has relevant info but is not presented particularly well and the page has a huge amount of annoying animated ads. Google’s second result is to a pretty useful Wikipedia article. Kagi links to Wikipedia first, and second to <a href="https://labuladong.gitbook.io/algo-en/iv.-high-frequency-interview-problem/union-find-explanation">this page</a> which has no ads and has nice illustrations of the algorithm.</li>
</ul>]]></content><author><name>garymm</name></author><category term="internet" /><category term="information" /><category term="computers" /><category term="search" /><summary type="html"><![CDATA[Kagi is a relatively new search engine. Unlike Google, it makes money through user subscriptions and shows no ads. Despite having decreased my usage of web search since the release of ChatGPT, I still use it a lot, and would be willing to pay a few bucks a month for a significantly better experience. To evaluate Kagi, I put 75 of my recent search queries into Kagi and Google and rated which I preferred. The queries spanned various topics, heavily tilted towards software engineering and computer topics.]]></summary></entry><entry><title type="html">Attention, Memory, and Productive Knowledge Work</title><link href="http://garymm.org/blog/2024/06/09/attention-memory-productive-knowledge-work/" rel="alternate" type="text/html" title="Attention, Memory, and Productive Knowledge Work" /><published>2024-06-09T00:00:00-07:00</published><updated>2024-06-09T00:00:00-07:00</updated><id>http://garymm.org/blog/2024/06/09/attention-memory-and-productive-knowledge-work</id><content type="html" xml:base="http://garymm.org/blog/2024/06/09/attention-memory-productive-knowledge-work/"><![CDATA[<p>Here I present some ideas for increasing the productivity of knowledge workers by structuring their workflows around attention and memory.
I wrote this for my own benefit, but I hope you find it useful too!</p>

<h2 id="workflow-matters">Workflow matters</h2>

<p>By “workflow” I mean loosely how execution tasks are scheduled and coordinated. By “execution tasks” I mean the activities which more-or-less-directly create value. For a software engineer, these tasks include programming and designing.</p>

<p>Much of the most influential thinking about optimizing workflows to increase productivity comes from the automobile industry. The history of car manufacturing has several inspiring examples. In 1909, a Ford Model T Runabout sold for $27,977 (in 2024 USD). In 1925 (16 years later), it sold for $4,517 (also in 2024 USD)<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. The number of Model Ts you could buy per dollar increased by over 6x!</p>

<p>Much of this increasing productivity was due to changes in the workflow. One major change was the introduction of the moving assembly line. Prior to the assembly line, cars were built through “the craft method”, in which teams of fifteen workers worked simultaneously on a single car”<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>, which makes me think of young children playing soccer. This was inefficient in many ways. People got in each others’ ways, and they had to spend time walking around the factory to go between cars. With a moving assembly line, parts came to the workers and each worker could complete their stage of production without having to walk, coordinate with others, move tools, etc.</p>

<h2 id="attention-and-memory-matter">Attention and memory matter</h2>

<p>In manufacturing, the main inputs wer materials, equipment and manual labor. In knowledge work, the main input is human minds. To increase productivity, we need to produce more output without increasing inputs. One way to do this is to optimize the workflow! One way to optimize the knowledge worker workflow is to understand some properties of attention and memory.</p>

<p>“Working memory is a cognitive system with a limited capacity that can hold information temporarily.”<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> It is essential for reasoning and decision making, which are crucial in knowledge work. The set of mental objects you can mentally manipulate at one time is limited by the capacity of your working memory. After switching tasks, it takes time to build up working memory.<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup></p>

<p>Long-term memory is the system that lets you restore previous working memories. “Forgetting” means something being lost from long-term memory. The “forgetting curve”<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> is a stylized fact. The longer you go without retrieving a memory, the more likely you are to forget it.</p>

<p><img class="wrap" src="/generated/2024-06-09-attention-memory-productive-knowledge-work-forgetting-curve-659-d8ed2f894.png" alt="the forgetting curve" srcset="/generated/2024-06-09-attention-memory-productive-knowledge-work-forgetting-curve-400-9b57bcb5c.webp 400w, /generated/2024-06-09-attention-memory-productive-knowledge-work-forgetting-curve-600-9b57bcb5c.webp 600w, /generated/2024-06-09-attention-memory-productive-knowledge-work-forgetting-curve-659-9b57bcb5c.webp 659w" /></p>

<p>This conceptualization of human memory is quite similar to how computers work: working memory is analogous to a computer’s volatile memory (e.g., registers), long-term memory is analogous to persistent storage (e.g., a flash drive), and forgetting is analogous to deleting a file.<sup id="fnref:6"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup></p>

<p>If it’s not obvious by now, workflow interacts with how our memories work. Every time one switches tasks, one must repopulate working memory before becoming productive. And extended periods between working on a task leads to forgetting. One must re-learn, which takes time.</p>

<h2 id="ways-our-workflow-makes-us-less-productive">Ways our workflow makes us less productive</h2>

<p>Yet many knowledge workers switch tasks very often. This has obviously been the case in many places I’ve worked, but there is some objective data to support this impression: A report from the summer of 2018 analyzed data from over fifty thousand active users of the RescueTime time tracking software. It found that the median time between checking communication apps like email and Slack was 6 minutes, and more than 2/3 of the users <em>never</em> experienced an hour of uninterrupted time.<sup id="fnref:7"><a href="#fn:7" class="footnote" rel="footnote" role="doc-noteref">7</a></sup></p>

<p>Besides this short-term switching between execution and collaboration, people often switch between tasks that they are executing. On several projects I worked on it was common to have work items that went unfinished for months, with sporadic bouts of work spaced weeks apart. Work done this way has many costs, but it definitely incurs costs of forgetting and re-learning.</p>

<h2 id="suggestions">Suggestions</h2>

<p>At a high level:</p>

<ul>
  <li>Minimize context switches so as to avoid cost of loading things into working memory.</li>
  <li>Minimize time between sessions of work on a single task so as to avoid forgetting.</li>
  <li>Remember that convenience ≠ productivity.</li>
</ul>

<p>And now some specific ways to put these principles into practice.</p>

<h3 id="use-meetings-well">Use meetings well</h3>

<p>Instant messages, e-mail, and interactions on doc comments are all asynchronous. Each message involves a context switch. Meetings are synchronous, rapid, concentrated communication. My rule of thumb: after the third message in an E-Mail or IM conversation, it’s better to switch to a meeting. An illustration of the cost of context switches that can be avoided by a meeting:</p>

<p><img class="wrap" src="/generated/2024-06-09-attention-memory-productive-knowledge-work-side-by-side-653-0d3f63e19.png" alt="two alternative ways to schedule work" srcset="/generated/2024-06-09-attention-memory-productive-knowledge-work-side-by-side-400-e692b153f.webp 400w, /generated/2024-06-09-attention-memory-productive-knowledge-work-side-by-side-600-e692b153f.webp 600w, /generated/2024-06-09-attention-memory-productive-knowledge-work-side-by-side-653-e692b153f.webp 653w" /></p>

<p>Meetings certainly have their own costs, and are often run poorly, but producing fewer context switches is a huge and underappreciated advantage of meetings over asynchronous communication.</p>

<h4 id="meeting-tips">Meeting tips</h4>

<p>Regularly scheduled meetings are useful for regular, non-urgent communication. Participants know they’ll be able to discuss things relatively soon, and therefore can avoid resorting to asynchronous communication. Between meetings, participants can collect agenda items in a document as they arise. This is an example of convenience ≠ productivity. If I think of something to ask my coworker, it’s more convenient for me to IM him. But, if it’s not urgent, it’s more productive for me to write it the agenda of our next regularly scheduled meeting.</p>

<p>For group meetings, it can be efficient to have a structured way for participants to schedule smaller-group follow-up meetings. When I was a manager at Microsoft, my team’s regular sync meetings were 60 minutes, but the whole team was only expected to meet for at most 30 minutes, and the rest of the hour was used for smaller group follow-up meetings. This avoids a context switch between blocks of meetings and blocks of solo wrk. And it avoids asynchronous back-and-forth to schedule a follow-up meeting.</p>

<p>Finally, there are many ways meetings can be inefficient, but if participants are vigilant and vocal, they can be improved (or cancelled! Not all meetings are worthwhile).</p>

<h3 id="schedule-asynchronous-communication">Schedule asynchronous communication</h3>

<p>By default, don’t leave your inbox open, don’t leave your IM app open, and don’t leave your phone notifications on. Check these things on a schedule that balances responsiveness to others with your own ability to focus. Personally I follow a loose schedule of checking things first thing in the morning, immediately before meetings, and once or twice during the afternoon, when I happen to feel blocked or need a mental break.</p>

<p>I used to have a problem with getting distracted by my inbox every time I sent an email. To send email without checking your inbox, you can use <a href="https://mail.google.com/mail/?fs=1&amp;tf=cm">this link for GMail</a> or <a href="https://outlook.office365.com/mail/0/deeplink/compose">this one for Outlook</a>.</p>

<h3 id="schedule-focused-work">Schedule focused work</h3>

<p>One technique for avoiding self-imposed distraction is called “<a href="https://en.wikipedia.org/wiki/Pomodoro_Technique">Pomodoro</a>”, which basically consists of setting a timer, and taking a break when the timer goes off.</p>

<p>To increase the odds of having large blocks of time to focus, schedule events on your calendar that prevent others from scheduling meetings. If you have the option to work in a place that is quiet and physically isolated, try to do that during your scheduled focus blocks.</p>

<h3 id="limit-the-number-of-in-progress-tasks">Limit the number of in-progress tasks</h3>

<p>Limiting the number of tasks that you have in-progress can help reduce the temptation to context switch (and flush your working memory) and it will reduce the odds that you forget important details about one incomplete task while you’re working on another. This is a key feature of <a href="https://en.wikipedia.org/wiki/Kanban_(development)">Kanban</a> and <a href="https://en.wikipedia.org/wiki/Scrum_(software_development)">Scrum</a>.</p>

<h3 id="use-tools-to-disseminate-commonly-needed-information">Use tools to disseminate commonly needed information</h3>

<p>Some information is so commonly needed that the questions should be anticipated and built into tools that are used as part of the regular workflow. For example “Who is working on this task?” Proper use of an issue tracker (e.g., GitHub Issues, Asana) can answer this without the back-and-forth of asynchronous communication or the time burden of a meeting. If you find there is some question like this that is repeatedly asked, but has fairly formulaic answers, check if there’s a tool that you can adopt that will disseminate that information more efficiently.</p>

<h3 id="speed-up-testing-and-reviews">Speed up testing and reviews</h3>

<p>This is somewhat specific to software development, but it probably has analogs in other professions.</p>

<p>The “testing and reviews” part of the software workflow typically looks like:</p>

<p>While not approved:</p>

<p>    Author: rebase, (read / think, write, build, run) until ready.</p>

<p>    Wait for automatic checks.</p>

<p>    Reviewer: read / think, comment. Maybe approve.</p>

<p>This leaves plenty of room for context switching and forgetting:</p>

<p>While not approved:</p>

<p>    Wait for author to start.  <b>Author and reviewer forget.</b></p>

<p>    Author: rebase, (read / think, write, build, run) until ready.  <b>Author context switch. Reviewer forgets.</b></p>

<p>    Wait for max(automatic checks, reviewer to start).  <b>Author and reviewer context switch + forget.</b></p>

<p>    Reviewer: read / think, comment. Maybe approve.  <b>Reviewer context switch.  Author forgets.</b></p>

<p>What can we do about this? If automatic checks are frequently the bottleneck, spend time speeding them up. If code reviews are the bottleneck, speed those up. On a previous team I set up a duty rotation to review any changes that did not yet have a reviewer. You might also experiment with pair programming, which basically combines code review and programming.</p>

<h2 id="acknowledgements-and-further-reading">Acknowledgements and further reading</h2>

<p>Besides my own experience, this post is based on the following:</p>

<p>A World Without Email by Cal Newport.</p>

<p>Deep Work by Cal Newport.</p>

<p>Getting Things Done by David Allen.</p>

<p>More Effective Agile by Steve McConnell.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>https://en.wikipedia.org/wiki/Ford_Model_T#Price_and_production <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2">
      <p>A world without email, page 97. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3">
      <p><a href="https://en.wikipedia.org/wiki/Working_memory">https://en.wikipedia.org/wiki/Working_memory</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4">
      <p>https://en.wikipedia.org/wiki/Psychological_refractory_period <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5">
      <p>https://en.wikipedia.org/wiki/Forgetting_curve <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6">
      <p>https://en.wikipedia.org/wiki/Memory_hierarchy <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7">
      <p>A world without email, page 11 <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>garymm</name></author><category term="work" /><category term="productivity" /><summary type="html"><![CDATA[Here I present some ideas for increasing the productivity of knowledge workers by structuring their workflows around attention and memory. I wrote this for my own benefit, but I hope you find it useful too!]]></summary></entry><entry><title type="html">Why I considered IVF despite not having any fertility issues, and then decided against it</title><link href="http://garymm.org/blog/2024/03/10/why-i-considered-ivf/" rel="alternate" type="text/html" title="Why I considered IVF despite not having any fertility issues, and then decided against it" /><published>2024-03-10T00:00:00-08:00</published><updated>2024-03-10T00:00:00-08:00</updated><id>http://garymm.org/blog/2024/03/10/why-i-considered-ivf</id><content type="html" xml:base="http://garymm.org/blog/2024/03/10/why-i-considered-ivf/"><![CDATA[<p>In <a href="https://www.garymm.org/blog/2023/11/10/the-dangers-of-reproducing-while-old/">The dangers of reproducing while old</a>, I mentioned that pre-implantation genetic testing could be a way for older prospective parents to improve outcomes. This led my partner and I to seriously consider doing IVF despite not having any fertility issues. At a high level, my conclusion after writing that post was:</p>

<ul>
  <li>Older parents are at much higher risk of passing harmful genetic mutations to their embryos.</li>
  <li>IVF gives people the opportunity to screen for these harmful mutations, potentially avoiding miscarriage or serious health conditions later on. It also provides an opportunity to use polygenic screening to select for desirable traits.</li>
  <li>There’s not much evidence that IVF results in worse health outcomes for children.</li>
</ul>

<p>After investigating this more, I now think:</p>

<ul>
  <li>The benefit of polygenic screening is currently generally small, and in our case it would be tiny.</li>
  <li>There’s some evidence of IVF producing worse health outcomes.</li>
</ul>

<h2 id="polygenic-screening">Polygenic screening</h2>

<p>I think polygenic screening<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> has a lot of potential, but currently it has serious limitations, including:</p>

<ol>
  <li>The models for certain traits like intelligence just aren’t very good at predicting. This is called the “missing heritability” problem, and it’s quite controversial exactly what’s going on, but some of the issues are clear. One is that current models are based on data that measured other things, like educational attainment, which are correlated with intelligence, but not perfectly (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0160289606000171">e.g. Deary et. al find 0.81</a>). And it’s partly because some of the variance in traits we care about is probably caused by very rare variants which would require huge sample sizes to detect.</li>
  <li>The models are much better for people of certain ancestry than others, because of the data that they were created with. My impression is that the models currently work best for people of northern European ancestry.</li>
  <li>The expected variance amongst embryos from the same parents is pretty low. To have good odds of finding an embryo that has polygenic scores much better than the average of its genetic siblings (i.e. the other embryos parents will be choosing amongst<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>), one needs a lot of embryos to choose from.</li>
</ol>

<p>Point 1 should decrease the appeal of polygenic screening for everyone. Point 2 decreased it for me because my children would be mostly not-northern European. As for the number of embryos, my partner took a test that suggested a round of egg retrieval would yield relatively few eggs. This meant she’d probably have to undergo multiple rounds of egg retrieval, which is expensive and unpleasant.</p>

<p>It’s hard to find good information on this topic. It’s not hard to find people saying that the technology doesn’t work, but I get the distinct whiff of motivated reasoning from articles like <a href="https://liorpachter.wordpress.com/2021/04/12/the-amoral-nonsense-of-orchids-embryo-selection/">this one by Lior Pachter</a>. Basically I think the author thinks polygenic screening is morally wrong or disgusting, and therefore he’s finding reasons to say it won’t work.</p>

<p>I want polygenic screening to work. If we were doing IVF anyways, I would definitely have the embryos polygenically screened.</p>

<h2 id="ivf-health-outcomes">IVF health outcomes</h2>

<p>I found the excellent paper <a href="https://academic.oup.com/humupd/article/25/2/137/5316072?login=false">The health of children conceived by ART: ‘the chicken or the egg?’</a>, which looks at studies that try to control for the systematic differences between people who pursue IVF and those who don’t. That review, and newer studies such as <a href="https://pubmed.ncbi.nlm.nih.gov/35934120/">Sutclifffe (2023</a>) did not find large differences in IVF babies later in life. But the best-controlled studies <em>do</em> find a pretty large increased risk of preterm birth (relative risk somewhere in 1.5 - 2). <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791955/">Elsewhere</a> I found the absolute risk of preterm birth for a 36 year old mother is about 6%, so IVF might take that up to 9-12%. Since preterm birth is associated with a lot of bad health conditions, it would be surprising if IVF children are more likely to be born preterm but are equally healthy later on.</p>

<p>I’m wondering how it’s possible to reconcile the preterm birth and generally good outcomes in adults. Some possibilities:</p>

<ul>
  <li>Preterm birth rates are not higher. The studies are just not controlling for something important.</li>
  <li>Preterm birth caused by IVF is not associated with later bad health, but preterm birth caused by other things is.</li>
  <li>There are negative impacts, but the studies of adults have missed them. Some possible reasons why: studies are too small, the negative health outcomes haven’t shown up yet because most IVF babies are too young, studies looked at the wrong metrics, there’s some selection bias such that the least healthy people are less likely to be studied (this one seems quite plausible to me).</li>
</ul>

<p>My current guess is there’s a small but real tendency for people conceived via IVF to be less healthy later in life, though I’m extremely uncertain about the exact aspects of health, the magnitude and the frequency involved.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">
      <p>If you are not familiar with polygenic screening, I recommend <a href="https://www.lesswrong.com/posts/yT22RcWrxZcXyGjsA/how-to-have-polygenically-screened-children">Gene Smith’s post</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2">
      <p>Unless you’re using donor gametes. In which case you might use multiple donors and compare across non-siblings. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>garymm</name></author><category term="parenting" /><category term="biology" /><category term="health" /><summary type="html"><![CDATA[In The dangers of reproducing while old, I mentioned that pre-implantation genetic testing could be a way for older prospective parents to improve outcomes. This led my partner and I to seriously consider doing IVF despite not having any fertility issues. At a high level, my conclusion after writing that post was:]]></summary></entry><entry><title type="html">The dangers of reproducing while old</title><link href="http://garymm.org/blog/2023/11/10/the-dangers-of-reproducing-while-old/" rel="alternate" type="text/html" title="The dangers of reproducing while old" /><published>2023-11-10T00:00:00-08:00</published><updated>2023-11-10T00:00:00-08:00</updated><id>http://garymm.org/blog/2023/11/10/the-dangers-of-reproducing-while-old</id><content type="html" xml:base="http://garymm.org/blog/2023/11/10/the-dangers-of-reproducing-while-old/"><![CDATA[<p>I had my first child when I was 36 years old, which made me want to understand the risks of having children at different ages. Before looking into this, my impression was that the main biological problems with old-age parenthood had to do with not having the necessary health and vigor to care for young’uns, and I had heard that older women have trouble getting pregnant. While those are real issues, there are many others worthy of consideration.</p>

<p>My read of the evidence is that the risks of miscarriage and serious health problems for children, including autism and birth defects, increase significantly with parental (both paternal and maternal) age. The data I could find for most risks is not very fine-grained and not very precise, but I think this qualitative description matches the data: Risks start rising at around 30 years old for both mothers and fathers, rises gradually through about 35 for mothers and 40 for fathers, and then sharply after that.</p>

<p>Interestingly, the ages at which things start to go wrong are similar for fathers and mothers, but the mechanisms are different. Sperm cells are produced throughout a man’s life, and each time a new cell is produced, there is a chance of a genetic mutation. Sperm are produced by copying the DNA of other short-lived cells, which are themselves produced in the same way, so mutations accumulate. Women’s egg cells, however, are all present when a woman is born, but over time they accumulate damage.</p>

<p>If this is correct, then there are two ways to reduce these risks: have kids when young, or use frozen gametes from your younger selves. If you’re already in the danger zone and don’t have frozen gametes, pre-implantation genetic testing may be able to screen out embryos that have certain genetic defects, and thus reduce the risk of some bad outcomes.</p>

<p>My advice:</p>

<ul>
  <li>If you want to have kids at some later age, and that later age is &gt;= 35 for a woman or &gt;= 40 for a man, freeze your gametes ASAP.</li>
  <li>If you’re already past those age thresholds and you have the means, consider in-vitro fertilization so you can take advantage of pre-implantation genetic testing.</li>
</ul>

<p>In “the dangers” section below I summarize some evidence on how parental age interacts with various risks. What’s not obvious is the relationship between the different risks. That is, are they mostly independent of each other, or is a child born with e.g., a heart defect much more likely to be autistic? They are not independent. For example, <a href="https://www.nature.com/articles/pr2006181">Eide et al.</a> find a significant correlation between birth defects and intellectual disability. So if you want to know “what are the odds my kid comes out totally healthy”, I think just looking at the highest risk and ignoring the others is reasonable.</p>

<p>If you’re interested in the details supporting the above conclusions, read on.</p>

<h2 id="technical-jargon">Technical jargon</h2>

<p>Skip this if you know these terms.</p>

<h3 id="prevalence">Prevalence</h3>

<p>The prevalence is what fraction of the population has the outcome of interest. Basically:</p>

<p>(number of people with the outcome) / (number of people that were studied).</p>

<h3 id="odds-ratio">Odds ratio</h3>

<p>An odds ratio is a ratio of how likely the outcome of interest is in the condition of interest, to how likely it is in some reference condition. For the data below, the condition is always a particular parental age range, and the reference condition is some other age range that the researchers chose. For example, say we set the reference age to 25, and our outcome of interest is being born with green hair. If a study finds that children of fathers aged 30 have 1/10 odds of being born with green hair, and the children of fathers aged 25 have 1/100 odds of being born with green hair, then the odds ratio for age 30 is 1/10 / 1 / 100 = 10.</p>

<h3 id="95-ci">95% CI</h3>

<p>A 95% CI (confidence interval) is a range of values. Under certain assumptions<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> there is a 95% chance that the true value falls within that range.</p>

<h2 id="pre-implantation-genetic-testing">Pre-implantation genetic testing</h2>

<p>Pre-implantation genetic testing is done on embryos that have been fertilized in-vitro before implanting them into a woman (more details <a href="https://www.lesswrong.com/posts/yT22RcWrxZcXyGjsA/how-to-have-polygenically-screened-children#But_how_do_they_even_get_an_embryo_s_DNA_">here</a>). After developing for about 10 days, embryos have enough cells that some can be removed for genetic testing. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9674466/">Sordia-Hernandez et al.</a> look at the effects of testing for aneuploidy, which is a specific kind of genetic defect which often results in miscarriage or abortion. It finds significant benefits for women &gt;= 35 years old, but not for younger women.</p>

<p>Note that for almost everything else, the outcome being tested for is bad, whereas here it is live birth rate, or the odds of a child being born alive after an embryo is transferred into a woman.</p>

<table border="1">
  <tr>
   <td><strong>Mother’s age</strong>
   </td>
   <td><strong>Live birth rate odds ratio 95% CI (relative to no genetic testing)</strong>
   </td>
  </tr>
  <tr>
   <td>&lt; 35
   </td>
   <td>0.56, 1.34
   </td>
  </tr>
  <tr>
   <td>&gt;= 35
   </td>
   <td>1.07, 2.84
   </td>
  </tr>
</table>

<p>Very recently, some companies have started offering more in-depth genetic screening for embryos, such as assessing risk for polygenic traits, meaning influenced by many genes. The companies offering this service claim all sorts of benefits, such as reducing the risk of cancer and diabetes, but I don’t think it’s been independently evaluated, and it’s probably too new to truly evaluate, since there’s a very small number of people alive who were screened in this way. <a href="https://www.lesswrong.com/posts/yT22RcWrxZcXyGjsA/how-to-have-polygenically-screened-children">Here’s Gene Smith’s post</a> that’s very enthusiastic about such screening and tells you how to go about it, and <a href="https://www.lesswrong.com/posts/yT22RcWrxZcXyGjsA/how-to-have-polygenically-screened-children?commentId=uiFXXRpdXCzXjmfj8">my response trying to summarize a skeptical position</a>.</p>

<p>So if you’re older and you don’t have frozen gametes, should you do IVF just so you can do pre-implantation genetic testing?</p>

<p>Pros:</p>

<ul>
  <li>Very effective at detecting aneuploidy, and thus increasing live birth rate per pregnancy.</li>
  <li>You can choose the child’s sex.</li>
  <li>If you opt for polygenic screening, it is possible to reduce other health risks and possibly improve other desirable traits like IQ. Again, see <a href="https://www.lesswrong.com/posts/yT22RcWrxZcXyGjsA/how-to-have-polygenically-screened-children">Gene Smith’s post</a> for more details on this.</li>
  <li>I haven’t seen any strong evidence that IVF results in worse health outcomes. Note there are many studies that show worse outcomes for IVF, but since IVF is largely used by people who have fertility problems, and differences seem to disappear entirely when controlling for this. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3650450/">More here</a>.</li>
</ul>

<p>Cons:</p>

<ul>
  <li>Expensive in time and money (maybe $25,000 in 2023).</li>
  <li>Will be more expensive and / or less effective for women who produce fewer eggs per retrieval, which is mostly older women.</li>
  <li>There is <a href="https://academic.oup.com/humupd/article/25/2/137/5316072?login=false">some evidence</a> that IVF results in differences in the embryo that might possibly result in less healthy people (vs old-fashioned conception). I think the odds that this results in worse outcomes are quite low, but it’s worth mentioning.</li>
</ul>

<h2 id="the-dangers">The dangers</h2>

<h3 id="miscarriage">Miscarriage</h3>

<p>This chart from <a href="https://www.bmj.com/content/364/bmj.l869">Magnus et al.</a> shows the absolute risk by mother’s age. Y-axis is the proportion of pregnancies that end in miscarriage:</p>

<p><img class="wrap" src="/generated/2023-11-10-the-dangers-of-reproducing-while-old/maternal-miscarriage-780-b27867d53.jpg" alt="absolute risk of miscarriage by maternal age" srcset="/generated/2023-11-10-the-dangers-of-reproducing-while-old/maternal-miscarriage-400-902027773.webp 400w, /generated/2023-11-10-the-dangers-of-reproducing-while-old/maternal-miscarriage-600-902027773.webp 600w, /generated/2023-11-10-the-dangers-of-reproducing-while-old/maternal-miscarriage-780-902027773.webp 780w" /></p>

<p>And here’s some data from <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7456349/">du Fossé et al.</a> on the risk by father’s age. For the absolute risk, I assumed the absolute risk for the reference age is 10%, which seems to be about the value for a 27 year old woman from the chart above.</p>

<table border="1">
  <tr>
   <td><strong>Father’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
   <td><strong>Absolute risk 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>25-29
   </td>
   <td>reference
   </td>
   <td>10%
   </td>
  </tr>
  <tr>
   <td>30-34
   </td>
   <td>0.9, 1.21
   </td>
   <td>9%, 12.1%
   </td>
  </tr>
  <tr>
   <td>35-39
   </td>
   <td>0.92, 1.43
   </td>
   <td>9.2%, 14.3%
   </td>
  </tr>
  <tr>
   <td>40-44
   </td>
   <td>1.06, 1.43
   </td>
   <td>10.6%, 14.3%
   </td>
  </tr>
  <tr>
   <td>&gt;= 45
   </td>
   <td>1.13, 1.81
   </td>
   <td>11.3%, 18.1%
   </td>
  </tr>
</table>

<h3 id="autism">Autism</h3>

<h4 id="prevalence-1">Prevalence</h4>

<p>People with a huge range of abilities and tendencies are all diagnosed with autism, and there’s a lot of debate about the accuracy of many diagnoses. However “profound autism” is a diagnosis with much clearer criteria. <a href="https://www.researchgate.net/publication/370128310_The_Prevalence_and_Characteristics_of_Children_With_Profound_Autism_15_Sites_United_States_2000-2016">Hughes et al.</a> defined profound autism “as children with autism who were either nonverbal or minimally verbal or had an (intelligence quotient) IQ &lt;50”. That study estimated the prevalence of profound autism in the USA as:</p>

<table border="1">
  <tr>
   <td>Female
   </td>
   <td>1.88 / 1000 = 1 / 532
   </td>
  </tr>
  <tr>
   <td>Male
   </td>
   <td>7.18 / 1000 = 1 / 139
   </td>
  </tr>
  <tr>
   <td>Overall
   </td>
   <td>4.59 / 1000 = 1 / 218
   </td>
  </tr>
</table>

<p>These numbers seem shockingly high, but they do somewhat match my casual observations. I don’t know a lot of children, but I know of at least 2 profoundly autistic boys.</p>

<h4 id="risk-by-parental-age">Risk by parental age</h4>

<p>The studies I found on the impact of parental age did not restrict themselves to just profound autism, so it’s possible that parental age interacts with profound autism differently, but my guess is it’s at least qualitatively correct.</p>

<p>Here are the results from <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638544/">Durkin et al.</a>, who looked at both the father’s and mother’s ages. For the absolute risk for fathers, I used the number of autism spectrum disorder cases / size of the “Birth Chort Comparison Group” for the father’s or mother’s reference age, extracted from table 3 of the paper. For fathers that’s 322 / 67,080 = 0.48%. For mother’s that’s 366 / 75,053 = 0.49%. These numbers are close to the overall risk of profound autism from Hughes et al. above, but this study considered any autism diagnosis, so something is probably wrong either with my calculation or with one or both of these studies.</p>

<table border="1">
  <tr>
   <td><strong>Father’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
   <td><strong>Absolute risk 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>&lt;20
   </td>
   <td>0.4, 1.0
   </td>
   <td>0.19%, 0.48%
   </td>
  </tr>
  <tr>
   <td>20-24
   </td>
   <td>0.7, 1.1
   </td>
   <td>0.34%, 0.53%
   </td>
  </tr>
  <tr>
   <td>25-29
   </td>
   <td>Reference
   </td>
   <td>0.48%
   </td>
  </tr>
  <tr>
   <td>30-34
   </td>
   <td>0.9, 1.2
   </td>
   <td>0.43%, 0.58%
   </td>
  </tr>
  <tr>
   <td>35-39
   </td>
   <td>0.9, 1.3
   </td>
   <td>0.43%, 0.62%
   </td>
  </tr>
  <tr>
   <td>&gt;= 40
   </td>
   <td>1.1, 1.8
   </td>
   <td>0.53%, 0.86%
   </td>
  </tr>
</table>

<table border="1">
  <tr>
   <td><strong>Mother’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
   <td><strong>Absolute risk 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>&lt;20
   </td>
   <td>0.5, 1.0
   </td>
   <td>0.25%, 0.49%
   </td>
  </tr>
  <tr>
   <td>20-24
   </td>
   <td>0.8, 1.1
   </td>
   <td>0.39%, 0.54%
   </td>
  </tr>
  <tr>
   <td>25-29
   </td>
   <td>Reference
   </td>
   <td>0.49%
   </td>
  </tr>
  <tr>
   <td>30-34
   </td>
   <td>0.9, 1.3
   </td>
   <td>0.44%, 0.64%
   </td>
  </tr>
  <tr>
   <td>&gt;= 35
   </td>
   <td>1.1, 1.6
   </td>
   <td>0.54%, 0.64%
   </td>
  </tr>
</table>

<p>Note from paper: “Because the increased risk was similar for ages 35–39 and ≥40 years, the high-risk maternal age category was defined as ≥35 years.”</p>

<p>And here are results from <a href="https://doi.org/10.1038/mp.2010.121">another study</a> that looked only at the father’s age:</p>

<table border="1">
  <tr>
   <td><strong>Father’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>15-29
   </td>
   <td>Reference
   </td>
  </tr>
  <tr>
   <td>30-39
   </td>
   <td>1.0, 1.42
   </td>
  </tr>
  <tr>
   <td>40-49
   </td>
   <td>1.07, 1.87
   </td>
  </tr>
  <tr>
   <td>&gt;= 50
   </td>
   <td>1.26, 3.88
   </td>
  </tr>
</table>

<p>This chart shows absolute risks from that same study:</p>

<p><img class="wrap" src="/generated/2023-11-10-the-dangers-of-reproducing-while-old/paternal-autism-571-ad78da241.png" alt="absolute risk of autism by paternal age" srcset="/generated/2023-11-10-the-dangers-of-reproducing-while-old/paternal-autism-400-5c21b43d3.webp 400w, /generated/2023-11-10-the-dangers-of-reproducing-while-old/paternal-autism-571-5c21b43d3.webp 571w" /></p>

<h3 id="chromosome-disorders">Chromosome disorders</h3>

<h4 id="prevalence-2">Prevalence</h4>

<p><a href="https://pubmed.ncbi.nlm.nih.gov/9934980/">Caron, Tihy, and Dallaire</a> find that of mothers aged &gt;= 35,  1.79% or 1 / 55 have a chromosomal disorder in the second trimester. Note that some chromosome disorders result in miscarriage earlier than that, so the true prevalence is certainly higher.</p>

<h4 id="risk-by-parental-age-1">Risk by parental age</h4>

<p>To compute absolute risk, I took the prevalence number from above and then divided it by 5.66 (the middle of the odds ratio CI for mothers aged &gt;= 35) to get 1.79 % / 5.66 = 0.32%.</p>

<table border="1">
  <tr>
   <td><strong>Father’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
   <td><strong>Absolute risk 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>&lt;20
   </td>
   <td>1.01, 1.89
   </td>
   <td>0.32%, 0.60%
   </td>
  </tr>
  <tr>
   <td>25-29
   </td>
   <td>reference
   </td>
   <td>0.32%
   </td>
  </tr>
  <tr>
   <td>&gt;= 40
   </td>
   <td>1.12, 1.52
   </td>
   <td>0.36%, 0.49%
   </td>
  </tr>
</table>

<p>From <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803514/">Fang et al.</a></p>

<table border="1">
  <tr>
   <td><strong>Mother’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
   <td><strong>Absolute risk 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>&lt;20
   </td>
   <td>0.54, 0.88
   </td>
   <td>0.17%,	0.28%
   </td>
  </tr>
  <tr>
   <td>20-34
   </td>
   <td>reference
   </td>
   <td>0.32%
   </td>
  </tr>
  <tr>
   <td>&gt;= 35
   </td>
   <td>5.13, 6.2
   </td>
   <td>1.64%,	1.98%
   </td>
  </tr>
</table>

<p>From <a href="https://obgyn.onlinelibrary.wiley.com/doi/10.1111/aogs.14339">Ahn et al.</a></p>

<h3 id="urogenital-defects">Urogenital defects</h3>

<h4 id="prevalence-3">Prevalence</h4>

<p>1.60 / 1000 = 1 / 625. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6472003/">Source</a>.</p>

<h4 id="risk-by-parental-age-2">Risk by parental age</h4>

<p>I didn’t find an easy way to calculate absolute risk.</p>

<table border="1">
  <tr>
   <td><strong>Father’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>&lt;20
   </td>
   <td>1.03, 2.19
   </td>
  </tr>
  <tr>
   <td>25-29
   </td>
   <td>reference
   </td>
  </tr>
  <tr>
   <td>&gt;= 40
   </td>
   <td>1.07, 1.52
   </td>
  </tr>
</table>

<p>From <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803514/">Fang et al.</a></p>

<table border="1">
  <tr>
   <td><strong>Mother’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>20-34
   </td>
   <td>reference
   </td>
  </tr>
  <tr>
   <td>&gt;= 35
   </td>
   <td>1.13, 1.89
   </td>
  </tr>
</table>

<p>From <a href="https://obgyn.onlinelibrary.wiley.com/doi/10.1111/aogs.14339">Ahn et al.</a></p>

<h3 id="heart-defects">Heart defects</h3>

<h4 id="prevalence-4">Prevalence</h4>

<p>137.1 / 10,000 = 1 / 73. <a href="https://www.sciencedirect.com/science/article/pii/S0002870314004980">Source</a>.</p>

<p>Note: this seems really high to me. Maybe most of these are not very serious, or maybe I know people who were born with heart defects but I don’t know they have them.</p>

<h4 id="risk-by-parental-age-3">Risk by parental age</h4>

<p>I didn’t find an easy way to calculate absolute risk.</p>

<table border="1">
  <tr>
   <td><strong>Father’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>&lt;20
   </td>
   <td>0.96, 1.16
   </td>
  </tr>
  <tr>
   <td>25-29
   </td>
   <td>reference
   </td>
  </tr>
  <tr>
   <td>&gt;= 40
   </td>
   <td>1.01, 1.2
   </td>
  </tr>
</table>

<p>From <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7803514/">Fang et al.</a></p>

<table border="1">
  <tr>
   <td><strong>Mother’s age</strong>
   </td>
   <td><strong>Odds ratio 95% CI</strong>
   </td>
  </tr>
  <tr>
   <td>&lt;20
   </td>
   <td>0.79, 1.1
   </td>
  </tr>
  <tr>
   <td>20-34
   </td>
   <td>reference
   </td>
  </tr>
  <tr>
   <td>&gt;= 35
   </td>
   <td>1.06, 1.24
   </td>
  </tr>
</table>

<p>From <a href="https://obgyn.onlinelibrary.wiley.com/doi/10.1111/aogs.14339">Ahn et al.</a></p>

<!-- Footnotes themselves at the bottom. -->
<h2 id="notes">Notes</h2>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1">

      <p>Which scientists sometimes violate and thus invalidate their own results, but for now I’m just assuming these stats are sound. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>garymm</name></author><category term="parenting" /><category term="biology" /><category term="health" /><summary type="html"><![CDATA[I had my first child when I was 36 years old, which made me want to understand the risks of having children at different ages. Before looking into this, my impression was that the main biological problems with old-age parenthood had to do with not having the necessary health and vigor to care for young’uns, and I had heard that older women have trouble getting pregnant. While those are real issues, there are many others worthy of consideration.]]></summary></entry></feed>