Authors

Vedanshu Joshi
Vedanshu Joshi Software Engineer
GP Saggese
GP Saggese Chief Technology Officer
Shayan Ghasemnezhad
Shayan Ghasemnezhad Infrastructure & DevOps Lead Engineer

Metadata

Thursday, May 07, 2026
Software Engineering Testing
TL;DR

Three things that turn test maintenance from a growing chore into a non-event: golden files for expected output, automatic reproducibility across machines, and speed-enforced test tiers.


Test suites should get easier to maintain as a codebase grows, not harder. In practice, the opposite happens: inline expected strings become walls of text, flaky tests appear on machines other than yours, and the "fast" suite slowly fills up with tests that take 30 seconds each.

At Causify, we hit all three of these problems early. Our codebase is heavy with data transformations and pipeline outputs, so tests that check formatted DataFrames or multi-step reports are common. Managing them with standard pytest patterns did not scale.

We built a test base class to fix this, released it under the Apache 2.0 license in our helpers repository, and have been running it in production for years. This post covers the three things that make it different: golden file testing, reproducibility by default, and speed-tiered test classification.

Golden File Testing#

The most common test maintenance burden in data-heavy codebases is updating expected strings. A standard pytest test for a function that produces a formatted report looks like this:

def test_report(self) -> None:
        result = build_report(data)
        expected = """
# shape=(100, 5)
# columns=open,high,low,close,volume
# index=[2024-01-01, 2024-03-31]
min_price=42.1, max_price=98.7
"""
        self.assertEqual(result, expected)

When the format changes (a new column, a different precision, a renamed field), the test breaks. You run pytest, read the diff in terminal output, copy the new value, paste it into the source file, repeat for every broken test.

For one-line outputs this is tolerable. For a 50-line formatted table or a deeply nested config, it is a time sink. And the PRs that result are hard to review: the meaningful output change is buried inside edited Python string literals that reviewers have to trust are correct.

The Fix: Expected Output Lives on Disk#

Our solution is to stop putting expected values inside test code. Instead, the expected output for each test lives in its own text file on disk, called a "golden file". The test code itself stays minimal:

import helpers.hunit_test as hunitest

class TestBuildReport1(hunitest.TestCase):
        def test1(self) -> None:
                # Prepare inputs.
                data = load_test_data()
                # Run test.
                result = build_report(data)
                # Check outputs.
                self.check_string(result)

The first time you run this with --update_outcomes, the framework writes the actual output to a file next to the test file:

test/outcomes/TestBuildReport1.test1/output/test.txt

Every subsequent run without that flag compares actual output against the stored file and fails with a diff if they differ.

When output changes intentionally, you run one command:

pytest --update_outcomes

The framework overwrites all affected golden files and stages them with git add. Your PR now contains a clean .txt file change instead of edited Python strings. Reviewers can read it directly. A wrong update is obvious at a glance.

When a test produces multiple string outputs, pass tag= to give each its own golden file:

self.check_string(summary, tag="summary")
self.check_string(details, tag="details")

What the Golden File Looks Like#

The file TestBuildReport1.test1/output/test.txt holds the expected output exactly as the function produced it:

# shape=(100, 5)
# columns=open,high,low,close,volume
# index=[2024-01-01, 2024-03-31]
min_price=42.1, max_price=98.7

That is all. No quotes, no escape sequences, no surrounding Python syntax. When a reviewer opens a PR that changes this file, they read the output change directly, not a modified string literal buried inside a test function.

Predictable Directory Layout#

Every test class gets three directories derived from its class and method name.

Static fixtures checked into git:

test/outcomes/TestBuildReport1.test1/input/

Golden files checked into git:

test/outcomes/TestBuildReport1.test1/output/

Ephemeral files deleted after the test:

test/scratch/TestBuildReport1.test1/

The input/ directory holds static fixtures the test reads. The output/ directory is where check_string() writes and compares golden files. The scratch/ directory is for temporary artefacts produced during the test that you do not want to commit.

Because the paths are derived from the class and method name, you can always find a test's data without reading the test code.

When a test fails on mismatch, the framework prints a side-by-side diff and outputs a ready-to-run vimdiff command so you can inspect the full difference in one copy-paste. When the golden file is missing entirely, it tells you to run with --update_outcomes to generate it.

When Golden Tests Help Most#

Golden file testing pays off most when the output you are testing is large, structured, or changes as a unit rather than field by field.

Some cases where we reach for it immediately:

  • Formatted DataFrames: a 50-row summary table with column widths, index labels, and float precision. An inline string is unreadable; a .txt file is just a table
  • Nested config objects: hierarchical configuration printed for debugging or logging. When a new key is added, the golden diff shows exactly where it landed
  • Pipeline stage reports: multi-step ETL or model-training output where each stage appends a section. One command updates the whole snapshot
  • API or CLI output: responses from external tools called in integration tests, where the full response matters but is too large to inline

For simple two-value assertions (assertEqual(result, 42)) inline is still fine. Golden files shine when the expected output would take up more space than the test logic itself.

String Assertions with Automatic Diffing#

There are three levels of string assertion in the framework: - assertEqual(result, 42) — scalars and one-liners - assert_equal(result, expected) — short strings you can write inline, with sdiff and a vimdiff script on failure - check_string(result) — large outputs that belong in a golden file

assert_equal() uses the same diff mechanism as check_string() but does not touch the filesystem. Pass fuzzy_match=True to ignore whitespace differences, or purify_text=True to strip machine-specific paths before comparing.

Testing DataFrames with Numerical Tolerance#

For pandas DataFrames the framework provides check_dataframe(), which serializes the DataFrame in CSV format as its golden file and compares numerically rather than as raw text:

import helpers.hunit_test as hunitest

class TestPrices1(hunitest.TestCase):
        def test1(self) -> None:
                result = compute_prices(data)
                self.check_dataframe(result)

The golden file follows the same path pattern as check_string() and ends in .txt (the default tag is test_df, so the file is test_df.txt). Run pytest --update_outcomes once to generate it, then commit it alongside the test — the same workflow as string golden files. On a mismatch the framework prints the actual and expected DataFrames, a masked view showing only the differing cells, and the relative error for each cell.

The err_threshold parameter (default 0.05) sets the relative tolerance passed to numpy.allclose:

self.check_dataframe(result, err_threshold=0.01)   # tight: 1% tolerance
self.check_dataframe(result, err_threshold=0.10)   # loose: 10% tolerance

Use a tight threshold for exact financial figures and a looser one where floating-point rounding across environments is expected. When multiple DataFrames appear in the same test, pass a tag argument to give each its own golden file:

self.check_dataframe(prices_df, tag="prices")
self.check_dataframe(returns_df, tag="returns")

When the expected DataFrame is small enough to construct inline, assert_dfs_close() skips the golden file and compares directly:

class TestPrices1(hunitest.TestCase):
        def test1(self) -> None:
                result = compute_prices(data)
                expected = pd.DataFrame({"price": [42.1, 98.7]})
                self.assert_dfs_close(result, expected)

It checks index, columns, and values using numpy.allclose and accepts the same keyword arguments as check_dataframe().

Reproducibility Without the Boilerplate#

Golden files only stay stable if tests produce the same output on every machine. Floating-point formatting, pandas display options, and random state all vary across environments and library versions unless you explicitly control them.

Our base class handles this automatically. Before every test method runs, it resets the random seed, restores pandas display options to known defaults, and replaces matplotlib.pyplot.show with a function that does nothing. None of this requires any code in your test class. Inherit from hunitest.TestCase and reproducibility comes for free.

A test that passes on a developer's laptop passes the same way in CI and six months later when a new pandas version ships with different default display widths.

If your test class needs custom setup, override set_up_test() and tear_down_test() rather than setUp() / tearDown(). The base class runs these through a pytest fixture that guarantees teardown executes even when a test fails, which the standard setUp/tearDown pair does not guarantee in all pytest versions.

Handling Non-Deterministic Output#

Some outputs contain values that legitimately differ between machines or runs: absolute file paths, usernames in log lines, memory addresses in repr() output. These would cause every golden file to mismatch on a different developer's machine.

The framework handles this with a purify_text flag on check_string:

self.check_string(result, purify_text=True)

When purify_text=True, a TextPurifier runs over the output before comparison and strips known sources of machine-specific noise: absolute paths, usernames, and similar patterns. The golden file stores the cleaned version, so it matches everywhere.

Test-Mode Utilities#

The base class exposes three small helpers that solve common annoyances when writing tests. All three are importable from helpers.hunit_test:

import helpers.hunit_test as hunitest

in_unit_test_mode() returns True when the code is executing inside a pytest run. Use it to gate expensive setup or behavior that only makes sense during tests:

if hunitest.in_unit_test_mode():
        # skip network call during unit tests
        return cached_fixture

pytest_print(txt) prints text that bypasses pytest's output capture. Standard print() calls are swallowed unless you pass -s; this function always writes directly to stdout regardless of capture mode:

hunitest.pytest_print("debug snapshot: " + str(intermediate_result))

pytest_warning(txt) does the same but prepends a yellow-colored WARNING: prefix. The framework uses it internally when a golden file is created or updated, so you always see an explicit notice in the test run output:

hunitest.pytest_warning("using fallback data source")

Speed-Tiered Test Classification#

The "fast" suite becoming slow is a silent killer of developer feedback loops. We solve it by classifying tests into three tiers using pytest markers, enforced with timeouts:

Tier Marker Timeout When to run
Fast (no marker) 5 s Every commit and PR
Slow @pytest.mark.slow 30 s Before merging
Superslow @pytest.mark.superslow 3600 s Scheduled CI

Unmarked tests are fast by default. The team runs the fast suite before every PR with a single command. Slow and superslow tests run in CI on a schedule or before release. Timeouts are enforced by pytest-timeout, so a test that accidentally grows past its tier's limit fails loudly instead of silently slowing everyone down.

To reduce flakiness from transient timing issues, pytest-rerunfailures automatically re-runs timed-out fast tests twice and slow or superslow tests once before marking them as failed.

How It Fits Into a CI Pipeline#

Because golden files are committed to the repository, CI does not need any special configuration to use them. The runner checks out the branch (golden files come with it) and runs the test suite exactly the same way a developer does locally. A mismatch fails the build with a diff; a match passes.

This means adding a new test with golden file assertions requires zero CI configuration changes. Write the test, run pytest --update_outcomes once to generate the golden file, commit both, and the pipeline handles the rest on every subsequent run.

For long-running tests that build intermediate artefacts, pytest --incremental preserves scratch directories between runs so reruns skip redundant setup.

Together, the three features mean a new team member can clone the repo, run the test suite, and get reliable results on the first try, without reading a setup guide first.

Use It Yourself#

The framework is open-source under the Apache 2.0 license. You can clone it, use it as a Git submodule, or lift the pattern directly into your own codebase:

git clone https://github.com/causify-ai/helpers.git

The core of the framework is a single file:

  • helpers/hunit_test.py: the TestCase base class with golden file testing, reproducibility resets, and directory helpers

Full documentation:

Share

Related Posts

A Look at Runnable Directories: The Solution to the Monorepo vs Multi-repo Debate Friday, January 16, 2026
Back to Blogs Monitoring Airflow on Kubernet…