BetterCodeBetterScience · poldrack · Feb 16, 2026 · Feb 14, 2026 · Feb 14, 2026 · Feb 16, 2026
diff --git a/Makefile b/Makefile
@@ -1,6 +1,9 @@
 clean:
 	- rm -rf book/_build
 
+spellcheck:
+	uv run codespell */*.md
+
 build-html: clean
 	myst build --html
 	npx serve _build/html

diff --git a/book/AI_coding_assistants.md b/book/AI_coding_assistants.md
diff --git a/book/images/metr_horizon_benchmark.png b/book/images/metr_horizon_benchmark.png
diff --git a/book/images/opus_effort_comparison.png b/book/images/opus_effort_comparison.png
diff --git a/book/images/stackoverflow_trend.png b/book/images/stackoverflow_trend.png
diff --git a/book/images/wei-COT.png b/book/images/wei-COT.png
diff --git a/book/references.bib b/book/references.bib
@@ -1,13 +1,54 @@
 %% This BibTeX bibliography file was created using BibDesk.
 %% https://bibdesk.sourceforge.io/
 
-%% Created for Russell Poldrack at 2026-02-12 16:32:43 -0800 
+%% Created for Russell Poldrack at 2026-02-16 11:18:33 -0800 
 
 
 %% Saved with string encoding Unicode (UTF-8) 
 
 
 
+@book{Ousterhout:2021aa,
+	author = {John Ousterhout},
+	date-added = {2026-02-16 11:17:41 -0800},
+	date-modified = {2026-02-16 11:18:32 -0800},
+	edition = {2nd Edition},
+	publisher = {Yaknyam Press},
+	title = {A Philosophy of Software Design},
+	year = {2021}}
+
+@misc{Bridgeford:2025aa,
+	archiveprefix = {arXiv},
+	author = {Eric W. Bridgeford and Iain Campbell and Zijao Chen and Zhicheng Lin and Harrison Ritz and Joachim Vandekerckhove and Russell A. Poldrack},
+	date-added = {2026-02-16 10:25:22 -0800},
+	date-modified = {2026-02-16 10:25:24 -0800},
+	eprint = {2510.22254},
+	primaryclass = {cs.SE},
+	title = {Ten Simple Rules for AI-Assisted Coding in Science},
+	url = {https://arxiv.org/abs/2510.22254},
+	year = {2025},
+	bdsk-url-1 = {https://arxiv.org/abs/2510.22254}}
+
+@misc{Wei:2023aa,
+	archiveprefix = {arXiv},
+	author = {Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou},
+	date-added = {2026-02-14 12:12:15 -0800},
+	date-modified = {2026-02-14 12:12:17 -0800},
+	eprint = {2201.11903},
+	primaryclass = {cs.CL},
+	title = {Chain-of-Thought Prompting Elicits Reasoning in Large Language Models},
+	url = {https://arxiv.org/abs/2201.11903},
+	year = {2023},
+	bdsk-url-1 = {https://arxiv.org/abs/2201.11903}}
+
+@article{METR:2025aa,
+	author = {METR},
+	date-added = {2026-02-13 06:43:42 -0800},
+	date-modified = {2026-02-13 06:43:44 -0800},
+	journal = {arXiv preprint arXiv:2503.14499},
+	title = {Measuring AI Ability to Complete Long Tasks},
+	year = {2025}}
+
 @misc{McDuff:2024aa,
 	archiveprefix = {arXiv},
 	author = {Daniel McDuff and Tim Korjakow and Scott Cambo and Jesse Josua Benjamin and Jenny Lee and Yacine Jernite and Carlos Mu{\~n}oz Ferrandis and Aaron Gokaslan and Alek Tarkowski and Joseph Lindley and A. Feder Cooper and Danish Contractor},
@@ -120,7 +161,7 @@ @misc{Gruenpeter:2024aa
 	bdsk-url-1 = {https://doi.org/10.5281/zenodo.10786147}}
 
 @article{Smith:2016aa,
-	author = { AM Smith and DS Katz and KE Niemeyer and FORCE11 Software Citation Working Group},
+	author = {AM Smith and DS Katz and KE Niemeyer and FORCE11 Software Citation Working Group},
 	date-added = {2026-02-11 10:20:47 -0800},
 	date-modified = {2026-02-11 10:22:50 -0800},
 	journal = {PeerJ Computer Science},

diff --git a/book/workflows.md b/book/workflows.md
@@ -980,6 +980,8 @@ This dataset should be saved to tests/data/testdata.h5ad.
 
 Claude Code took about 20 minutes to generate an entire test framework for the code, comprising 215 test functions and 19 test fixtures.  Interestingly, Claude disregarded my instructions to use functions rather than classes for tests, generating 78 test classes.  While I usually prefer tests to be in pure functions rather than classes so that novices can more easily understand them, I decided in this case to stay with the class-based implementation since I don't mind it and it does make the organization of the tests a bit cleaner. 
 
+The initial test set for this project had no tests for one of the modules, and other modules with significant portions untested.  I was able to improve this by having Claude Code analyze the code coverage report and identify important parts of the code that were not currently covered, which moved the test coverage from 69% to 88% of the 870 statements in the code that were identified by the `coverage` tool.
+
 #### Avoiding the happy path
 
 Because it is essential for AI-generated tests to be assessed by a knowledgeable human, I proceeded to read all of the tests that had been generated by Claude.  Fortunately they were all easily readable and clearly named, which made it relatively easy to see some potential problems right away. Several kinds of issues arose.
@@ -1033,19 +1035,8 @@ In other cases, the tests that were generated were too minimal, allowing obvious
 
 Pseudobulking is an operation that should summarize all cells of a given type for each donor, but none of the test conditions actually check that it has been properly applied.  In fact, these tests could pass if `run_pseudobulk_pipeline()` simply passed the original data back without doing anything to it!  This is a case where domain knowledge is essential to get the tests right and avoid the happy path.  In several other cases the tests called `pytest.skip()` (which causes the test to be skipped) for outcomes that really should have triggered a test failure.  For example, it skipped the integration tests for the full dataset if the dataset hadn't already been created, and it also skipped the *Snakemake* integration functions if the *Snakemake* call failed (which it initially did because of a missing argument).  
 
-#### Lessons learned about reviewing AI-generated tests
-
 These examples highlight the need to closely examine the test code that is generated by AI agents.  However it's worth noting that although it took a significant amount of human time to read over the AI-generated tests, the time spent was still far less than if I had undertaken writing the test code without AI assistance, and Claude was also able to fix all of the issues to my satisfaction after I raised them.  
 
-My examination of the AI-generated code highlighted a number of failure points that one should look for when reviewing AI-generated test code:
-
-- *Weak assertions*: In many cases there were assertions present that would have passed even if the function did not give an appropriate result. They were in effect functioning more like *smoke tests* (i.e. testing whether the function runs without crashing) rather than unit tests that are meant to test whether the function returns the proper kinds of outputs.  It's important to understand what the function's intended output is, and make sure that the actual output matches that intention.
-- *Testing for changes*: When data go into a function, we generally expect the output to be changed in some way.  It's important to test specifically whether the intended changes were made.
-- *Numerical precision*: One of these tests initially failed because it was comparing equality of two very large numbers (93483552 vs 93483547), which differed due to floating point errors.  It is important to test equality of floating point numbers using a method that allows for some degree of tolerance (e.g. `pytest.approx()`), though this can be tricky to calibrate in a way that catches real errors but avoids spurious errors.  
-- *Coverage gaps*: The initial test set for this project had no tests for one of the modules, and other modules with significant portions untested.  I was able to improve this by having Claude Code analyze the code coverage report and identify important parts of the code that were not currently covered, which moved the test coverage from 69% to 88% of the 870 statements in the code that were identified by the `coverage` tool.
-- *Checking for non-critical dependencies*: In many cases the code will simply crash when a dependency is missing, but in some cases (as in the harmony example above), the test may modify its behavior depending on the presence or absence of a particular dependency.  If the use of a particular dependency is critical to the workflow (as it was for this one) then it's important to check for those dependencies and make sure that they work properly.
-- *Outdated APIs*: In a couple of cases, the initial tests used calls to external package functions that were now deprecated.  This is hard to avoid given that the knowledge base of LLMs often lags many months behind current software, but adding `-W error::FutureWarning` to one's `pytest` commands can help identify features that are currently allowed but will be deprecated in the future.  In some cases there may be such errors that occur within an external package, in which case one may simply need to set the warning to be ignored (using `@pytest.mark.filterwarnings`) since it can't be fixed by the user.
-
 #### Property-based testing for workflows
 
 The tests initially developed for this workflow were built around the known characteristics of the expected data. However, there are many "unknown unknowns" when it comes to input data, and it's important to make sure that the code deals gracefully with problematic inputs. We can test this using a *property-based testing* approach; as I discussed in Chapter 4, this involves the generation of many different datasets that vary, and checking whether the code deals with them appropriately.  When I asked the coding agent to identify plausible candidates for property-based testing using the Hypothesis package, it generated [tests](https://github.com/BetterCodeBetterScience/example-rnaseq/blob/main/tests/test_hypothesis.py) centered on several different properties: