Skip to content

Reference

Every flag, marker, fixture, CLI command, and public function. The CLI and Python API below are rendered live from the source; the pytest surface (flags, marker, fixture, blob schema) is curated here. For the narrative versions see Quickstart, Choosing a metric, Grouping by dims, and Compare & gate CI.

pytest command-line flags

The plugin adds these to any pytest run (alongside pytest-benchmark's own flags). This table is generated from the plugin's own --help text, so it can't drift from the code:

Flag Default What
--benchmark-memory off Record peak memory (a memray pass) for every benchmark() call, not just the benchmark_memory fixture — no test changes. Off by default; the fixture is always measured, with or without this flag.
--benchmark-memory-repeats=N Force a fixed number of memray passes per benchmark, suite-wide; the reported peak is the min across them. Overridden per-test by @pytest.mark.benchmem(repeats=N). Default: adaptive — run passes until the min floor settles (≥2, cap 10). Set this for a fixed, reproducible count (e.g. CI gating against a saved baseline).
--benchmark-memory-warmup=N 1 Untracked dry-runs of the action before measuring, suite-wide, to shed one-time costs (lazy imports, first-touch caches) so the measured passes aren't inflated by cold start. Overridden per-test by @pytest.mark.benchmem(warmup=N). Default: 1; set 0 to disable.
--benchmark-memory-max-time=SECONDS Wall-clock budget for the adaptive memory passes (the analogue of --benchmark-max-time): caps how long adaptive sampling spends per benchmark. Ignored when --benchmark-memory-repeats forces a fixed count. Default: no time bound — the pass cap alone bounds it.
--benchmark-memory-compare=REF off Compare this run's peak memory against a prior saved run (a pytest-benchmark storage ref like 0001, or the latest if no value is given); folds base and delta-peak columns into the table.
--benchmark-memory-compare-fail=FIELD:THRESHOLD Fail the session on a memory regression, e.g. peak:10%, peak:5MiB, allocations:5% (repeatable). Fields: peak, allocated, allocations, rss (rss needs isolated runs). Implies --benchmark-memory-compare.
--benchmark-memory-profile=DIR Save the memray profile (.bin) into DIR — render later with memray flamegraph (or tree/summary). Scope follows the gate: WITH --benchmark-memory-compare-fail only the regressing ids, otherwise EVERY measured benchmark. Off by default (disk cost).
--benchmark-memory-profile-native off Capture native (C/C++/Rust) stacks in the kept profile, so the flamegraph attributes memory inside extension code (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Only affects --benchmark-memory-profile runs; opt-in (slower, bigger .bin). Per-test override: @pytest.mark.benchmem(profile_native=True). Off by default.
--benchmark-memory-table combined Layout for the memory metrics: combined (default) folds them into pytest-benchmark's timing table; split prints a separate memory table.
--benchmark-memory-columns=peak,allocated,allocations,rss Which memory metrics the table shows, comma-separated and in order: peak, allocated, allocations, rss (rss only shows for isolated runs). Default: peak only.
--benchmark-memory-stats=min,mean,max With repeats > 1, the stats each shown metric spreads into: min, mean, max, median, stddev. A single pass stays one column. Default: min,mean,max.

Timing regressions still use pytest-benchmark's own --benchmark-compare / --benchmark-compare-fail; the --benchmark-memory-compare* flags are the memory mirror. Their baseline comes from pytest-benchmark's storage (.benchmarks/) — save one first with --benchmark-save=NAME or --benchmark-autosave, or the gate finds nothing and passes. See Gate CI on a regression.

The benchmem marker

@pytest.mark.benchmem(repeats=3)
def test_build(benchmark_memory):
    ...
Kwarg Default What
repeats auto force a fixed N memray passes for this test (default: adaptive — see below). Every pass is kept (the blob stores the whole series); the headline peak is the minimum across them, and --stat reports any other. Overrides the suite-wide --benchmark-memory-repeats.
warmup 1 untracked dry-runs of the action before measuring, to shed one-time costs (lazy imports, first-touch caches). 0 disables. Overrides the suite-wide --benchmark-memory-warmup.
isolate False run each memray pass in a fresh process and also record whole-process resident memory as the rss metric — the physical/OOM-relevant peak memray's logical heap can't give. Per-test only (no suite-wide flag): rss is a whole-job capacity number, meaningful only for build+operate benchmarks, so you mark the specific ones. Needs a top-level, picklable benchmarked function (see the whole-job warning below).
profile_native False on the --benchmark-memory-profile path, capture native (C/C++/Rust) stacks in the kept .bin, so a flamegraph attributes extension-code memory (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Opt-in (slower, bigger .bin). Overrides the suite-wide --benchmark-memory-profile-native.
max_peak fail the test if the headline peak exceeds this absolute ceiling. A size string ("100MiB", units B/KiB/MiB/GiB) or a bare int (bytes).
max_allocated as max_peak, on allocated (total bytes).
max_allocations as above, on the allocations count — a bare number (no unit).

Isolated rss measures the whole job — build the state inside the callable

The rss metric (isolate=True) runs the action in a fresh, empty process. Two consequences:

The build must happen inside the measured callable, and the callable must be a top-level, picklable function. The child starts with nothing, so it must construct whatever it operates on; and spawn serializes the call with standard pickle, so a lambda or closure is rejected (we don't use cloudpickle) — pass a module-level function plus lightweight args.

# ✅ ships only the spec (~bytes); the child builds + writes cold = the whole job's RSS
benchmark_memory(build_and_write, spec, n)

# ❌ a lambda/closure — rejected; std pickle can't serialize it (even build-inside)
benchmark_memory(lambda: write(build(spec, n)))

# ❌ a top-level partial over a *pre-built* model pickles fine, but ships the model and
#    measures *deserializing* it, not building it — the build never re-runs in the child
model = build(spec, n)
benchmark_memory(partial(write, model))

You can't isolate a single sub-phase. Since the child must build before it can operate, isolated rss is a build-plus-operate capacity number by construction, never a per-phase figure (e.g. write-only). For per-phase memory, use the in-process peak metric, which can measure a write given an already-built model. So the rule is two-part: use a top-level function (no lambdas), and don't pass heavy pre-built state — build it inside.

Absolute ceilings — max_peak / max_allocated / max_allocations

@pytest.mark.benchmem(max_peak="100MiB", max_allocations=5000)
def test_build(benchmark_memory):
    benchmark_memory(build_model, 1000)

A baseline-free guardrail: the test fails if the measured metric exceeds the ceiling (test_build: peak 117 MiB exceeds max_peak 100 MiB). Thresholds are absolute only — there's no saved run to take a percent of; for relative gating against a prior run use --benchmark-memory-compare-fail or benchmem compare --fail-on. A ceiling is a worst-case budget, so with repeats > 1 (including adaptive sampling) the gate reads the worst pass — not the headline min — and fails if any pass breaches it; the two coincide for a single pass. The ceiling is enforced wherever memory is measured — the benchmark_memory fixture and the --benchmark-memory patch — but a plain benchmark() call without --benchmark-memory measures no memory, so the marker is a no-op there.

Scope: the benchmarked action only

This gates the benchmarked action only (the isolated call pytest-benchmem measures), not the whole test. For a whole-test limit or leak check, that's pytest-memray's limit_memory / limit_leaks — see the README's "With pytest-memray".

How many passes? By default pytest-benchmem samples adaptively — after an untracked warmup run, it runs the memray pass until the min floor settles (≥2 passes; capped at 10, or a --benchmark-memory-max-time budget). Deterministic code settles in ~3 passes; noisy code runs more. Set repeats=N (marker) or --benchmark-memory-repeats=N (suite) to force a fixed, reproducible count — what CI gating against a saved baseline wants. Full rationale and the noisy-workload guidance are in the guide: Repeats & adaptive sampling.

The benchmark_memory fixture

Depends on pytest-benchmark's benchmark fixture; measures peak in a separate untimed pass, then times via pytest-benchmark.

Order — memory first (cold), then timing

Every call form runs the memray pass first, then pytest-benchmark's timing (calibration + all rounds). This matters: timing runs the function thousands of times, which grows and fragments the allocator's arenas — so measuring memory after timing would report the warm plateau, not the fresh-process floor the headline min is meant to be. Memory-first measures the cold cost (the warmup pass still sheds the one-time cold-start within it); timing then runs cleanly, with no memray hooks active. This holds for __call__, pedantic, and the --benchmark-memory patch alike. The standalone measure_peak / measure_memory have no timing phase at all; warmup=0 skips the warmup, repeats=N forces a fixed count.

Times then measures function(*args, **kwargs):

benchmark_memory(sorted, data)

Explicit control, like pytest-benchmark's pedantic plus a memory pass:

benchmark_memory.pedantic(target, args=(), kwargs=None, setup=None,
                          rounds=1, warmup_rounds=0, iterations=1)
  • setup — a callable run untracked before each measured call; if it returns (args, kwargs), those supply the call's arguments. Used for both the timed rounds and each (adaptive) memory sample — one setup rebuilds fresh state for both — so a stateful action's memory samples stay independent. The same applies to benchmark.pedantic(setup=…) under --benchmark-memory: no extra changes.
  • rounds, warmup_rounds, iterations — as in pytest-benchmark.

Mostly memory, little timing? There's no memory-only switch — the entry rides pytest-benchmark's timing. To trim it: --benchmark-min-rounds=1 --benchmark-max-time=0 (no test changes), or pedantic(rounds=1, warmup_rounds=0) for a single call. For pure memory outside pytest, use measure_peak / measure_memory.

Attributes (available after a call):

Attribute What
extra_info pytest-benchmark's per-benchmark dict. Set scalars here to attach analysis dims; the memory blob lands here under the benchmem key.
peak_bytes peak memory (bytes) from the last call, or None before any call.
result the full MemoryResult from the last call, or None.

The extra_info.benchmem blob

Each measured benchmark stores this dict under extra_info["benchmem"] — three flat per-repeat series, one entry per memray pass. Every reported number (headline peak = min, any --stat) derives from these on read:

Key What
peak_bytes per-repeat high-water of live bytes — the peak metric (headline = min)
allocations per-repeat allocation count — the allocations metric
total_bytes per-repeat total bytes allocated — the allocated metric (churn peak hides)
rss_bytes per-repeat whole-process resident high-water (ru_maxrss) — the rss metric. Only present under isolate=True (each pass in a fresh process); absent otherwise.
{"peak_bytes": [800000, 805000], "allocations": [12, 12], "total_bytes": [800000, 805000]}

See Choosing a metric for when to reach for each, and --stat for distributions.

CLI — benchmem

Installed with pytest-benchmem[plot]. The full command tree and every option, captured live from the typer app as it actually renders in a terminal:

benchmem --help
Usage: benchmem [OPTIONS] COMMAND [ARGS]...

pytest-benchmem — plot and compare benchmark runs.

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or customize the
installation.
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮
plot Render an interactive plotly view from one or more pytest-benchmark runs.
compare Print a per-id table for one run, or compare two or more (and optionally gate CI).
sweep Run a benchmark suite across several installed versions of a package.
flamegraph Render a kept memory profile in one step — resolve the ``.bin`` for a test and run
memray.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem plot --help
Usage: benchmem plot [OPTIONS] RUNS...

Render an interactive plotly view from one or more pytest-benchmark runs.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* runs... PATH pytest-benchmark JSON file(s). [required]
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--columns [time|peak|allocated|allocation Metric to plot: time | peak |
s|rss] allocated | allocations | rss
(rss = isolated runs only). One
per figure (a plot has a single
value axis) — same flag as
`compare`; the spread shows as
whiskers via --band.
[default: time]
--view TEXT compare | scatter | sweep |
scaling (default: by count).
--facet TEXT Dim to facet by.
--x TEXT scaling: dim for the x-axis.
--clip FLOAT Clamp the colour scale.
--where TEXT Filter rows by dim: KEY=VALUE
(repeatable, AND-combined).
--free-axes [x|y|both] Free facet axes: x | y | both
(needs --facet).
--band [auto|minmax|none] scaling: spread whiskers on
memory metrics — auto | minmax |
none.
[default: auto]
--label -l TEXT Series label per run, in order
(repeat). Default: stem.
--output -o PATH HTML out.
--open --no-open [default: no-open]
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem compare --help
Usage: benchmem compare [OPTIONS] RUNS...

Print a per-id table for one run, or compare two or more (and optionally gate CI).

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* runs... PATH One or more pytest-benchmark runs, oldest → newest. One prints a plain
table; two or more compare (a sweep is N).
[required]
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--columns TEXT Comma list of metrics: time | peak | allocated | allocations | rss (rss
= isolated runs only; e.g. peak or time,peak,rss). Default: time,peak.
Each is shown across every --stat; a metric absent from every run is
dropped.
--group-by TEXT Group rows into sub-tables: fullname | name | func | group | module |
class | param:NAME (comma-composable).
[default: fullname]
--stat TEXT Which stat column(s) per metric: min | max | mean | median | stddev, or
all (the default) for the full spread side by side.
--sort TEXT Row order: name (id) | value (largest in the last run) | change.
[default: name]
--csv PATH Also write the raw (unscaled) comparison to this CSV file.
--fail-on TEXT Exit non-zero on a regression of the first run vs the last.
FIELD:THRESHOLD, repeatable — e.g. --fail-on peak:10% --fail-on
peak:5MiB --fail-on rss:10% (rss gates only isolated runs).
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem sweep --help
Usage: benchmem sweep [OPTIONS] PACKAGE VERSIONS...

Run a benchmark suite across several installed versions of a package.

Provisions one fresh uv venv per version, runs 'pytest <suite> --benchmark-only'
in each writing <out>/<version>.json, then prints the next step. --memory adds
the memory pass; forward any other pytest flag with --pytest-arg, e.g.
benchmem sweep mypkg 1.2.0 1.3.0 --suite benchmarks/ --memory --pytest-arg=-k.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* package TEXT Package under test; each plain version installs `<package>==<v>`.
[required]
* versions... TEXT Versions or pip specs to sweep, e.g. 1.2.0 1.3.0
git+https://github.com/me/pkg@main.
[required]
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
* --suite PATH Benchmark suite (dir or file) to run in each version's
venv.
[required]
--out PATH Directory for the per-version JSON runs.
[default: .benchmarks/sweep]
--memory --no-memory Add --benchmark-memory to each pytest run.
[default: no-memory]
--pytest-arg TEXT Arg forwarded to pytest, one token each, repeatable
(e.g. --pytest-arg=-k).
--pin TEXT Extra pip spec installed alongside (repeatable).
--as-of TEXT YYYY-MM-DD for uv --exclude-newer (reproducible
resolve).
--import-check TEXT Module asserted to resolve to the venv (isolation
preflight).
--copy-dir PATH Directory copied into each venv's cwd (the suite
imports from here).
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
benchmem flamegraph --help
Usage: benchmem flamegraph [OPTIONS] PROFILE_DIR [TEST_ID]

Render a kept memory profile in one step — resolve the ``.bin`` for a test and run memray.

Closes the "regressed → *where*?" loop after ``--benchmark-memory-profile``: instead of
finding the right ``.bin`` and remembering the memray subcommand, point at the profile dir
and name the test (or ``--worst peak`` to auto-pick the heaviest). Defaults to an HTML
flamegraph written next to the ``.bin``; ``--report tree|summary|stats`` prints to the
terminal instead.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
* profile_dir PATH Directory of kept .bin profiles (--benchmark-memory-profile).
[required]
[test_id] TEXT Test id (exact, or a unique substring) to render; omit with --worst.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
--worst TEXT Auto-pick the heaviest: peak | allocated | allocations
--report TEXT memray reporter: flamegraph | table | tree | summary | stats
[default: flamegraph]
--native --no-native Require the profile to carry native traces (captured via
--benchmark-memory-profile-native); error if it doesn't.
[default: no-native]
--output -o PATH HTML out path (default: next to the .bin).
--open --no-open Open the rendered HTML. [default: no-open]
--force -f Overwrite an existing render.
--help Show this message and exit.
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Public Python API

Light to import — pytest_benchmem re-exports only the engine and the readers; pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep shells to uv, so import those submodules directly.

Engine

measure_peak

measure_peak(
    action: Action, repeats: int | None = None
) -> int

Run action() under memray.Tracker and return peak bytes.

The bare one-liner for a REPL or notebook; :func:measure_memory returns the full result (allocation count, spread). repeats behaves as there — None (default) samples adaptively, an int forces a fixed pass count.

Parameters:

Name Type Description Default
action Action

The zero-argument callable to measure.

required
repeats int | None

Fixed pass count, or None to sample adaptively.

None

Returns:

Type Description
int

Peak bytes (the headline peak = min across passes, after warmup).

measure_memory

measure_memory(
    action: Action,
    repeats: int | None = None,
    *,
    warmup: int = _DEFAULT_WARMUP,
    isolate: bool = False,
    max_time: float | None = None,
    min_passes: int = _ADAPTIVE_MIN_PASSES,
    max_passes: int = _ADAPTIVE_MAX_PASSES,
    patience: int = _ADAPTIVE_PATIENCE,
    keep_bin: Path | None = None,
    native: bool = False,
    setup: Action | None = None,
) -> MemoryResult

Run action() under memray.Tracker → :class:MemoryResult, one pass per repeat.

warmup untracked dry-runs run first to shed one-time costs; then each measured pass gets a fresh tracker. The headline is the min across passes (see :class:MemoryResult); every pass's :class:Measurement is kept for spread stats.

With isolate=True each measured pass runs in a fresh spawned process (each warming itself), and that child's whole-process resident high-water (ru_maxrss) is recorded as :attr:Measurement.rss_bytes — a physical-memory reading attributable to the action, which an in-process pass can't give. The action (and setup) must be picklable (a top-level callable, not a lambda/closure); keep_bin is ignored in this mode.

Two modes, by repeats:

  • repeats=N (an int) — run exactly N passes. Fixed and reproducible; what CI gating and saved-baseline comparisons want.
  • repeats=None (default) — sample adaptively: keep running passes until the min stops moving (no new low for patience passes), bounded by min_passes (≥2), max_passes, and an optional max_time budget. Deterministic code settles in a few passes; noisy code runs more.

Parameters:

Name Type Description Default
action Action

The zero-argument callable to measure.

required
repeats int | None

Fixed pass count, or None to sample adaptively.

None
warmup int

Untracked dry-runs (setup + action) before measuring; 0 disables.

_DEFAULT_WARMUP
isolate bool

Run each pass in a fresh spawned process and record its ru_maxrss as :attr:Measurement.rss_bytes. Requires a picklable action/setup.

False
max_time float | None

Wall-clock budget (seconds) for adaptive sampling; None = no time bound.

None
min_passes int

Minimum passes when sampling adaptively.

_ADAPTIVE_MIN_PASSES
max_passes int

Hard ceiling on passes when sampling adaptively.

_ADAPTIVE_MAX_PASSES
patience int

Stop adaptive sampling after this many consecutive passes with no new min.

_ADAPTIVE_PATIENCE
keep_bin Path | None

If set, the first pass's profile .bin is retained here (for a later :func:render_flamegraph); the rest still go to temp dirs and are discarded.

None
native bool

Capture native (C/C++/Rust) stacks in the kept .bin so a flamegraph can attribute memory inside extension code instead of an opaque native bucket. Costs runtime and disk; only meaningful with keep_bin (ignored otherwise).

False
setup Action | None

Optional zero-arg callable run untracked before each pass (and each warmup run) — its allocations are not measured. Use it to rebuild fresh state so a stateful action (one that caches on or mutates a carried-over object) gives independent samples instead of a decaying/accumulating series. Mirrors pytest-benchmark's pedantic(setup=...).

None

Returns:

Name Type Description
A MemoryResult

class:MemoryResult over every measured pass (warmup runs are not retained).

MemoryResult dataclass

A memory measurement across repeats passes, derived from the per-repeat samples.

The per-repeat :attr:samples are the single source of truth — that's all the blob stores (the series); everything else is derived from them on read.

The headline :attr:peak_bytes is the minimum peak across passes — the fresh-process floor, unbiased by the in-process warm plateau (repeated runs fragment/grow arenas and allocate more) that a central stat would report. :attr:allocations / :attr:total_bytes come from that same min-peak run (a coherent snapshot); :attr:peak_bytes_max is the worst peak, so the spread is visible. A warm-plateau / steady-state read is available via the mean / median --stat. A single pass collapses all of these to its own values.

repeats property

repeats: int

How many passes were measured.

representative property

representative: Measurement

The min-peak run — the one the headline peak/allocations/total_bytes come from.

peak_bytes property

peak_bytes: int

The headline peak — the minimum high-water across passes (the fresh-process floor).

peak_bytes_max property

peak_bytes_max: int

The worst peak across repeats (equals :attr:peak_bytes with one repeat).

allocations property

allocations: int

Allocation count from the representative (min-peak) run.

total_bytes property

total_bytes: int

Total bytes allocated by the representative (min-peak) run.

rss_bytes property

rss_bytes: int | None

Headline whole-process RSS — the minimum ru_maxrss across isolated passes (the cold floor, like :attr:peak_bytes), or None if memory wasn't measured in isolation (in-process has no attributable process-global RSS).

series

series(field: str) -> list[Any]

The per-repeat values of one series field (SERIES_FIELDS or optional).

as_dict

as_dict() -> dict[str, Any]

The JSON blob stored under pytest-benchmark extra_info["benchmem"].

The three core per-repeat series, flat, plus any :data:OPTIONAL_SERIES_FIELDS that were measured (all-or-nothing per result). No denormalized scalars and no repeats (it's len of any series). Everything else derives on read.

from_blob classmethod

from_blob(blob: Mapping[str, Any]) -> MemoryResult

Rebuild from a blob's per-repeat series. Core columns are required; any :data:OPTIONAL_SERIES_FIELDS are read when present (else left None).

Measurement dataclass

One repeat's raw numbers — memray's peak high-water, allocation count, and total bytes allocated (cumulative churn, incl. temporaries GC later frees), plus an optional whole-process resident high-water.

rss_bytes is getrusage's ru_maxrss from an isolated pass (a fresh child process); None in-process, where a process-global RSS isn't attributable to the action.

Readers & loader

from_pytest_benchmark reads timing (seconds, from stats); memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem). load_samples is the unified reader; load_long_df stacks runs into the tidy frame the plots pivot. discover_runs collects saved runs from pytest-benchmark's .benchmarks/ storage, so you can hand the readers a directory instead of listing files.

from_pytest_benchmark

from_pytest_benchmark(
    path: str | Path, *, metric: str = "min"
) -> tuple[str, list[Sample], str]

Read timing out of a pytest-benchmark file → (label, samples, "s").

Dims come from each benchmark's parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

Parameters:

Name Type Description Default
path str | Path

A pytest-benchmark JSON file.

required
metric str

Which pytest-benchmark stat to read (min / median / …).

'min'

Returns:

Type Description
str

(label, samples, unit) — the run label, one :class:Sample per benchmark,

list[Sample]

and the unit ("s").

memory_from_pytest_benchmark

memory_from_pytest_benchmark(
    path: str | Path,
    *,
    field: str = "peak_bytes",
    reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]

Read memory out of a pytest-benchmark file → (label, samples, unit).

The benchmark_memory fixture stores each run's memory blob under extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same benchmark id pytest-benchmark uses. Benchmarks lacking the blob (timing-only tests) are skipped. Dims come from parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

Parameters:

Name Type Description Default
path str | Path

A pytest-benchmark JSON file.

required
field str

Which series to read — peak_bytes (unit B), allocations (count), or total_bytes.

'peak_bytes'
reduce Callable[[list[float]], float] | None

Reduce the per-repeat series to one scalar. Default (None) derives the headline (peak = min, allocations/total_bytes = the min-peak run); pass a callable for a distribution stat over the series instead.

None

Returns:

Type Description
str

(label, samples, unit) — the run label, one :class:Sample per benchmark

list[Sample]

with the blob, and the unit (B or count).

load_samples

load_samples(
    path: str | Path,
    *,
    metric: Metric = "time",
    stat: str | None = None,
) -> tuple[str, list[Sample], str]

Read one pytest-benchmark file for the chosen metric(label, samples, unit).

The unified reader over :func:from_pytest_benchmark (timing) and :func:memory_from_pytest_benchmark (memory).

Parameters:

Name Type Description Default
path str | Path

A pytest-benchmark JSON file.

required
metric Metric

Which metric to read (time / peak / allocated / allocations).

'time'
stat str | None

Distribution stat over the metric's per-repeat series (min / max / mean / median / stddev); None reads the headline scalar. For time it selects the pytest-benchmark stat (default min).

None

Returns:

Type Description
tuple[str, list[Sample], str]

(label, samples, unit) — the run label, its samples, and the metric's unit.

load_long_df

load_long_df(
    runs: str | Path | Sequence[str | Path],
    *,
    metric: Metric = "time",
    stat: str | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[pd.DataFrame, str]

Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).

One row per (run, id) for the chosen metric. Columns: snapshot (the series/version label), id, value, then one column per dim key seen (missing dims are NaN). Every plot view pivots this frame.

Parameters:

Name Type Description Default
runs str | Path | Sequence[str | Path]

One path or a sequence of pytest-benchmark JSON files.

required
metric Metric

Which metric to read (time / peak / allocated / allocations).

'time'
stat str | None

Distribution stat over the per-repeat series; None reads the headline scalar.

None
labels Sequence[str] | None

Overrides the snapshot label per run (one per path, same order), decoupling the display name from the filename; defaults to each file's stem.

None

Returns:

Type Description
tuple[DataFrame, str]

(df, unit) — the long-form frame and the metric's unit.

discover_runs

discover_runs(
    root: str | Path = ".benchmarks",
) -> list[Path]

Return pytest-benchmark JSON files under root (for CLI suggestions).

Parameters:

Name Type Description Default
root str | Path

Directory to search (default: pytest-benchmark's .benchmarks store).

'.benchmarks'

Returns:

Type Description
list[Path]

The JSON file paths found under root.

Sample

Bases: NamedTuple

One measured result: an opaque id, a value, and analysis dims.

Plotting — pytest_benchmem.plotting

Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths; labels names the series per run (defaults to the file stems) — the API behind plot's -l/--label. plot_compare's sort is "absolute" (native units) or "relative" (percent).

plot_scaling

plot_scaling(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    x: str | None = None,
    color: str | None = None,
    facet: str | None = None,
    log: bool | Literal["auto"] = "auto",
    band: Literal["auto", "minmax", "none"] = "auto",
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Cost vs a numeric dim, coloured/faceted by other dims.

x/color/facet default to inference from the dims (the lone numeric dim → x); pass them to override.

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON path(s).

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
x str | None

Dim for the x-axis (default: the lone numeric dim).

None
color str | None

Dim to colour by (default: inferred).

None
facet str | None

Dim to split into subplots (default: inferred).

None
log bool | Literal['auto']

"auto" log-scales when x is numeric and strictly positive; or force with a bool.

'auto'
band Literal['auto', 'minmax', 'none']

Spread whiskers (minmax of each point's per-pass series) on memory metrics — "auto" shows them where there's spread, "minmax" forces them on, "none" off. The line stays the headline (the min floor); whiskers reach up to the worst pass. Ignored for time.

'auto'
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
free_axes FreeAxes | None

"x" / "y" / "both" — unmatch a faceted axis from the shared default ("x" for incommensurable sweeps, "y" when facets have different cost scales, e.g. per function).

None
labels Sequence[str] | None

Names the snapshot in the title (default: file stem).

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

plot_scatter

plot_scatter(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    facet: str | None = None,
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Baseline cost (log-x) vs candidate/baseline ratio (log-y).

Top-right = slow and slower (the regressed corner). The first snapshot is the baseline; with 3+, the rest animate. Colour encodes the absolute Δ.

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON path(s); the first is the baseline, extras animate.

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
facet str | None

Dim to split into subplots.

None
clip float | None

Clamp the colour scale (default p95).

None
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
free_axes FreeAxes | None

Give each facet its own axes instead of sharing.

None
labels Sequence[str] | None

Series names per run (default: file stems).

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

plot_compare

plot_compare(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    sort: SortMode = "absolute",
    facet: str | None = None,
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    free_axes: FreeAxes | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).

The first two snapshots are compared; the first is the baseline.

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON path(s); only the first two are used.

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
sort SortMode

absolute plots b - a in the native unit; relative plots percent change.

'absolute'
facet str | None

Dim to split into subplots.

None
clip float | None

Clamp the colour scale (default symmetric p95).

None
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
free_axes FreeAxes | None

Give each facet its own axes instead of sharing.

None
labels Sequence[str] | None

Series names for the two runs (default: file stems).

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

plot_sweep

plot_sweep(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    clip: float | None = None,
    where: Mapping[str, str] | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.

Parameters:

Name Type Description Default
snapshots Snapshots

Run JSON paths; columns in order, the first is the reference.

required
metric Metric

Which metric to plot (time / peak / allocated / allocations).

'time'
clip float | None

Clamp the colour scale.

None
where Mapping[str, str] | None

Keep only rows matching these dim=value pairs.

None
labels Sequence[str] | None

Column (version) names (default: file stems).

None

Returns:

Type Description
tuple[Figure, int]

(figure, n_ids) — the plotly figure and the number of ids plotted.

Sweeps — pytest_benchmem.sweep

See Cross-version sweeps for the narrative, the Venv object, and the provision parameters.

sweep

sweep(
    versions: Sequence[str],
    run: Callable[[Venv], None],
    **provision_kwargs: object,
) -> list[str]

Provision a venv per version and call run(venv) in each.

run does whatever the consumer needs (invoke pytest / a memory command with venv.python and cwd=venv.cwd). Returns the list of versions that failed to provision.