Stop Maintaining Tests by Hand: Causify's Open-Source pytest Framework
Three things that turn test maintenance from a growing chore into a non-event: golden files for expected output, automatic reproducibility across machines, and speed-enforced test tiers.
Test suites should get easier to maintain as a codebase grows, not harder. In practice, the opposite happens: inline expected strings become walls of text, flaky tests appear on machines other than yours, and the "fast" suite slowly fills up with tests that take 30 seconds each.
At Causify, we hit all three of these problems early. Our codebase is heavy with data transformations and pipeline outputs, so tests that check formatted DataFrames or multi-step reports are common. Managing them with standard pytest patterns did not scale.
We built a test base class to fix this, released it under the Apache 2.0 license
in our
helpers repository,
and have been running it in production for years. This post covers the three
things that make it different: golden file testing, reproducibility by
default, and speed-tiered test classification.
Golden File Testing#
The most common test maintenance burden in data-heavy codebases is updating expected strings. A standard pytest test for a function that produces a formatted report looks like this:
def test_report(self) -> None:
result = build_report(data)
expected = """
# shape=(100, 5)
# columns=open,high,low,close,volume
# index=[2024-01-01, 2024-03-31]
min_price=42.1, max_price=98.7
"""
self.assertEqual(result, expected)
When the format changes (a new column, a different precision, a renamed field), the test breaks. You run pytest, read the diff in terminal output, copy the new value, paste it into the source file, repeat for every broken test.
For one-line outputs this is tolerable. For a 50-line formatted table or a deeply nested config, it is a time sink. And the PRs that result are hard to review: the meaningful output change is buried inside edited Python string literals that reviewers have to trust are correct.
The Fix: Expected Output Lives on Disk#
Our solution is to stop putting expected values inside test code. Instead, the expected output for each test lives in its own text file on disk, called a "golden file". The test code itself stays minimal:
import helpers.hunit_test as hunitest
class TestBuildReport1(hunitest.TestCase):
def test1(self) -> None:
# Prepare inputs.
data = load_test_data()
# Run test.
result = build_report(data)
# Check outputs.
self.check_string(result)
The first time you run this with --update_outcomes, the framework writes the
actual output to a file next to the test file:
test/outcomes/TestBuildReport1.test1/output/test.txt
Every subsequent run without that flag compares actual output against the stored file and fails with a diff if they differ.
When output changes intentionally, you run one command:
pytest --update_outcomes
The framework overwrites all affected golden files and stages them with
git add. Your PR now contains a clean .txt file change instead of edited
Python strings. Reviewers can read it directly. A wrong update is obvious at a
glance.
When a test produces multiple string outputs, pass tag= to give each its own
golden file:
self.check_string(summary, tag="summary")
self.check_string(details, tag="details")
What the Golden File Looks Like#
The file TestBuildReport1.test1/output/test.txt holds the expected output
exactly as the function produced it:
# shape=(100, 5)
# columns=open,high,low,close,volume
# index=[2024-01-01, 2024-03-31]
min_price=42.1, max_price=98.7
That is all. No quotes, no escape sequences, no surrounding Python syntax. When a reviewer opens a PR that changes this file, they read the output change directly, not a modified string literal buried inside a test function.
Predictable Directory Layout#
Every test class gets three directories derived from its class and method name.
Static fixtures checked into git:
test/outcomes/TestBuildReport1.test1/input/
Golden files checked into git:
test/outcomes/TestBuildReport1.test1/output/
Ephemeral files deleted after the test:
test/scratch/TestBuildReport1.test1/
The input/ directory holds static fixtures the test reads. The output/
directory is where check_string() writes and compares golden files. The
scratch/ directory is for temporary artefacts produced during the test that
you do not want to commit.
Because the paths are derived from the class and method name, you can always find a test's data without reading the test code.
When a test fails on mismatch, the framework prints a side-by-side diff and
outputs a ready-to-run vimdiff command so you can inspect the full difference
in one copy-paste. When the golden file is missing entirely, it tells you to run
with --update_outcomes to generate it.
When Golden Tests Help Most#
Golden file testing pays off most when the output you are testing is large, structured, or changes as a unit rather than field by field.
Some cases where we reach for it immediately:
- Formatted DataFrames: a 50-row summary table with column widths, index
labels, and float precision. An inline string is unreadable; a
.txtfile is just a table - Nested config objects: hierarchical configuration printed for debugging or logging. When a new key is added, the golden diff shows exactly where it landed
- Pipeline stage reports: multi-step ETL or model-training output where each stage appends a section. One command updates the whole snapshot
- API or CLI output: responses from external tools called in integration tests, where the full response matters but is too large to inline
For simple two-value assertions (assertEqual(result, 42)) inline is still
fine. Golden files shine when the expected output would take up more space than
the test logic itself.
String Assertions with Automatic Diffing#
There are three levels of string assertion in the framework:
- assertEqual(result, 42) — scalars and one-liners
- assert_equal(result, expected) — short strings you can write inline, with sdiff and a vimdiff script on failure
- check_string(result) — large outputs that belong in a golden file
assert_equal() uses the same diff mechanism as check_string() but does not
touch the filesystem. Pass fuzzy_match=True to ignore whitespace differences,
or purify_text=True to strip machine-specific paths before comparing.
Testing DataFrames with Numerical Tolerance#
For pandas DataFrames the framework provides check_dataframe(), which
serializes the DataFrame in CSV format as its golden file and compares
numerically rather than as raw text:
import helpers.hunit_test as hunitest
class TestPrices1(hunitest.TestCase):
def test1(self) -> None:
result = compute_prices(data)
self.check_dataframe(result)
The golden file follows the same path pattern as check_string() and ends in
.txt (the default tag is test_df, so the file is test_df.txt). Run
pytest --update_outcomes once to generate it, then commit it alongside the
test — the same workflow as string golden files. On a mismatch the framework
prints the actual and expected DataFrames, a masked view showing
only the differing cells, and the relative error for each cell.
The err_threshold parameter (default 0.05) sets the relative tolerance
passed to numpy.allclose:
self.check_dataframe(result, err_threshold=0.01) # tight: 1% tolerance
self.check_dataframe(result, err_threshold=0.10) # loose: 10% tolerance
Use a tight threshold for exact financial figures and a looser one where
floating-point rounding across environments is expected. When multiple
DataFrames appear in the same test, pass a tag argument to give each its own
golden file:
self.check_dataframe(prices_df, tag="prices")
self.check_dataframe(returns_df, tag="returns")
When the expected DataFrame is small enough to construct inline, assert_dfs_close()
skips the golden file and compares directly:
class TestPrices1(hunitest.TestCase):
def test1(self) -> None:
result = compute_prices(data)
expected = pd.DataFrame({"price": [42.1, 98.7]})
self.assert_dfs_close(result, expected)
It checks index, columns, and values using numpy.allclose and accepts the
same keyword arguments as check_dataframe().
Reproducibility Without the Boilerplate#
Golden files only stay stable if tests produce the same output on every machine. Floating-point formatting, pandas display options, and random state all vary across environments and library versions unless you explicitly control them.
Our base class handles this automatically. Before every test method runs, it
resets the random seed, restores pandas display options to known defaults, and
replaces matplotlib.pyplot.show with a function that does nothing.
None of this requires any code in your test class. Inherit from
hunitest.TestCase and reproducibility comes for free.
A test that passes on a developer's laptop passes the same way in CI and six months later when a new pandas version ships with different default display widths.
If your test class needs custom setup, override set_up_test() and
tear_down_test() rather than setUp() / tearDown(). The base class runs
these through a pytest fixture that guarantees teardown executes even when a
test fails, which the standard setUp/tearDown pair does not guarantee in
all pytest versions.
Handling Non-Deterministic Output#
Some outputs contain values that legitimately differ between machines or runs:
absolute file paths, usernames in log lines, memory addresses in repr()
output. These would cause every golden file to mismatch on a different
developer's machine.
The framework handles this with a purify_text flag on check_string:
self.check_string(result, purify_text=True)
When purify_text=True, a TextPurifier runs over the output before comparison
and strips known sources of machine-specific noise: absolute paths, usernames,
and similar patterns. The golden file stores the cleaned version, so it matches
everywhere.
Test-Mode Utilities#
The base class exposes three small helpers that solve common annoyances when
writing tests. All three are importable from helpers.hunit_test:
import helpers.hunit_test as hunitest
in_unit_test_mode() returns True when the code is executing inside a
pytest run. Use it to gate expensive setup or behavior that only makes sense
during tests:
if hunitest.in_unit_test_mode():
# skip network call during unit tests
return cached_fixture
pytest_print(txt) prints text that bypasses pytest's output capture.
Standard print() calls are swallowed unless you pass -s; this function
always writes directly to stdout regardless of capture mode:
hunitest.pytest_print("debug snapshot: " + str(intermediate_result))
pytest_warning(txt) does the same but prepends a yellow-colored WARNING:
prefix. The framework uses it internally when a golden file is created or
updated, so you always see an explicit notice in the test run output:
hunitest.pytest_warning("using fallback data source")
Speed-Tiered Test Classification#
The "fast" suite becoming slow is a silent killer of developer feedback loops. We solve it by classifying tests into three tiers using pytest markers, enforced with timeouts:
| Tier | Marker | Timeout | When to run |
|---|---|---|---|
| Fast | (no marker) | 5 s | Every commit and PR |
| Slow | @pytest.mark.slow |
30 s | Before merging |
| Superslow | @pytest.mark.superslow |
3600 s | Scheduled CI |
Unmarked tests are fast by default. The team runs the fast suite before every PR
with a single command. Slow and superslow tests run in CI on a schedule or
before release. Timeouts are enforced by pytest-timeout, so a test that
accidentally grows past its tier's limit fails loudly instead of silently
slowing everyone down.
To reduce flakiness from transient timing issues, pytest-rerunfailures
automatically re-runs timed-out fast tests twice and slow or superslow tests
once before marking them as failed.
How It Fits Into a CI Pipeline#
Because golden files are committed to the repository, CI does not need any special configuration to use them. The runner checks out the branch (golden files come with it) and runs the test suite exactly the same way a developer does locally. A mismatch fails the build with a diff; a match passes.
This means adding a new test with golden file assertions requires zero CI
configuration changes. Write the test, run pytest --update_outcomes once to
generate the golden file, commit both, and the pipeline handles the rest on
every subsequent run.
For long-running tests that build intermediate artefacts, pytest --incremental
preserves scratch directories between runs so reruns skip redundant setup.
Together, the three features mean a new team member can clone the repo, run the test suite, and get reliable results on the first try, without reading a setup guide first.
Use It Yourself#
The framework is open-source under the Apache 2.0 license. You can clone it, use it as a Git submodule, or lift the pattern directly into your own codebase:
git clone https://github.com/causify-ai/helpers.git
The core of the framework is a single file:
helpers/hunit_test.py: theTestCasebase class with golden file testing, reproducibility resets, and directory helpers
Full documentation:
- How to Write Unit Tests: naming conventions, assertions, mocking patterns
- How to Run Unit Tests: invoke commands, coverage, CI integration
- Unit Test Framework Architecture: design decisions and internals