Reference¶
Every flag, marker, fixture, CLI command, and public function. The CLI and Python API below are rendered live from the source; the pytest surface (flags, marker, fixture, blob schema) is curated here. For the narrative versions see Quickstart, Choosing a metric, Grouping by dims, and Compare & gate CI.
pytest command-line flags¶
The plugin adds these to any pytest run (alongside pytest-benchmark's own flags). This
table is generated from the plugin's own --help text, so it can't drift from the code:
| Flag | Default | What |
|---|---|---|
--benchmark-memory |
off | Record peak memory (a memray pass) for every benchmark() call, not just the benchmark_memory fixture — no test changes. Off by default; the fixture is always measured, with or without this flag. |
--benchmark-memory-repeats=N |
— | Force a fixed number of memray passes per benchmark, suite-wide; the reported peak is the min across them. Overridden per-test by @pytest.mark.benchmem(repeats=N). Default: adaptive — run passes until the min floor settles (≥2, cap 10). Set this for a fixed, reproducible count (e.g. CI gating against a saved baseline). |
--benchmark-memory-warmup=N |
1 |
Untracked dry-runs of the action before measuring, suite-wide, to shed one-time costs (lazy imports, first-touch caches) so the measured passes aren't inflated by cold start. Overridden per-test by @pytest.mark.benchmem(warmup=N). Default: 1; set 0 to disable. |
--benchmark-memory-max-time=SECONDS |
— | Wall-clock budget for the adaptive memory passes (the analogue of --benchmark-max-time): caps how long adaptive sampling spends per benchmark. Ignored when --benchmark-memory-repeats forces a fixed count. Default: no time bound — the pass cap alone bounds it. |
--benchmark-memory-compare=REF |
off | Compare this run's peak memory against a prior saved run (a pytest-benchmark storage ref like 0001, or the latest if no value is given); folds base and delta-peak columns into the table. |
--benchmark-memory-compare-fail=FIELD:THRESHOLD |
— | Fail the session on a memory regression, e.g. peak:10%, peak:5MiB, allocations:5% (repeatable). Fields: peak, allocated, allocations, rss (rss needs isolated runs). Implies --benchmark-memory-compare. |
--benchmark-memory-profile=DIR |
— | Save the memray profile (memray flamegraph (or tree/summary). Scope follows the gate: WITH --benchmark-memory-compare-fail only the regressing ids, otherwise EVERY measured benchmark. Off by default (disk cost). |
--benchmark-memory-profile-native |
off | Capture native (C/C++/Rust) stacks in the kept profile, so the flamegraph attributes memory inside extension code (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Only affects --benchmark-memory-profile runs; opt-in (slower, bigger .bin). Per-test override: @pytest.mark.benchmem(profile_native=True). Off by default. |
--benchmark-memory-table |
combined |
Layout for the memory metrics: combined (default) folds them into pytest-benchmark's timing table; split prints a separate memory table. |
--benchmark-memory-columns=peak,allocated,allocations,rss |
— | Which memory metrics the table shows, comma-separated and in order: peak, allocated, allocations, rss (rss only shows for isolated runs). Default: peak only. |
--benchmark-memory-stats=min,mean,max |
— | With repeats > 1, the stats each shown metric spreads into: min, mean, max, median, stddev. A single pass stays one column. Default: min,mean,max. |
Timing regressions still use pytest-benchmark's own --benchmark-compare /
--benchmark-compare-fail; the --benchmark-memory-compare* flags are the memory
mirror. Their baseline comes from pytest-benchmark's storage (.benchmarks/) — save
one first with --benchmark-save=NAME or --benchmark-autosave, or the gate finds
nothing and passes. See Gate CI on a regression.
The benchmem marker¶
| Kwarg | Default | What |
|---|---|---|
repeats |
auto | force a fixed N memray passes for this test (default: adaptive — see below). Every pass is kept (the blob stores the whole series); the headline peak is the minimum across them, and --stat reports any other. Overrides the suite-wide --benchmark-memory-repeats. |
warmup |
1 |
untracked dry-runs of the action before measuring, to shed one-time costs (lazy imports, first-touch caches). 0 disables. Overrides the suite-wide --benchmark-memory-warmup. |
isolate |
False |
run each memray pass in a fresh process and also record whole-process resident memory as the rss metric — the physical/OOM-relevant peak memray's logical heap can't give. Per-test only (no suite-wide flag): rss is a whole-job capacity number, meaningful only for build+operate benchmarks, so you mark the specific ones. Needs a top-level, picklable benchmarked function (see the whole-job warning below). |
profile_native |
False |
on the --benchmark-memory-profile path, capture native (C/C++/Rust) stacks in the kept .bin, so a flamegraph attributes extension-code memory (polars/numpy/solver bindings) instead of one opaque ??? at ??? bucket. Opt-in (slower, bigger .bin). Overrides the suite-wide --benchmark-memory-profile-native. |
max_peak |
— | fail the test if the headline peak exceeds this absolute ceiling. A size string ("100MiB", units B/KiB/MiB/GiB) or a bare int (bytes). |
max_allocated |
— | as max_peak, on allocated (total bytes). |
max_allocations |
— | as above, on the allocations count — a bare number (no unit). |
Isolated rss measures the whole job — build the state inside the callable
The rss metric (isolate=True) runs the action in a fresh, empty process. Two
consequences:
The build must happen inside the measured callable, and the callable must be a top-level,
picklable function. The child starts with nothing, so it must construct whatever it
operates on; and spawn serializes the call with standard pickle, so a lambda or closure
is rejected (we don't use cloudpickle) — pass a module-level function plus lightweight args.
# ✅ ships only the spec (~bytes); the child builds + writes cold = the whole job's RSS
benchmark_memory(build_and_write, spec, n)
# ❌ a lambda/closure — rejected; std pickle can't serialize it (even build-inside)
benchmark_memory(lambda: write(build(spec, n)))
# ❌ a top-level partial over a *pre-built* model pickles fine, but ships the model and
# measures *deserializing* it, not building it — the build never re-runs in the child
model = build(spec, n)
benchmark_memory(partial(write, model))
You can't isolate a single sub-phase. Since the child must build before it can operate,
isolated rss is a build-plus-operate capacity number by construction, never a per-phase
figure (e.g. write-only). For per-phase memory, use the in-process peak metric, which
can measure a write given an already-built model. So the rule is two-part: use a
top-level function (no lambdas), and don't pass heavy pre-built state — build it inside.
Absolute ceilings — max_peak / max_allocated / max_allocations¶
@pytest.mark.benchmem(max_peak="100MiB", max_allocations=5000)
def test_build(benchmark_memory):
benchmark_memory(build_model, 1000)
A baseline-free guardrail: the test fails if the measured metric exceeds the
ceiling (test_build: peak 117 MiB exceeds max_peak 100 MiB). Thresholds are absolute
only — there's no saved run to take a percent of; for relative gating against a prior run
use --benchmark-memory-compare-fail or benchmem compare --fail-on. A ceiling is a
worst-case budget, so with repeats > 1 (including adaptive sampling) the gate reads the
worst pass — not the headline min — and fails if any pass breaches it; the two coincide
for a single pass. The ceiling is enforced wherever memory is measured — the benchmark_memory
fixture and the --benchmark-memory patch — but a plain benchmark() call without
--benchmark-memory measures no memory, so the marker is a no-op there.
Scope: the benchmarked action only
This gates the benchmarked action only (the isolated call pytest-benchmem
measures), not the whole test. For a whole-test limit or leak check, that's
pytest-memray's limit_memory / limit_leaks
— see the README's "With pytest-memray".
How many passes? By default pytest-benchmem samples adaptively — after an untracked warmup
run, it runs the memray pass until the min floor settles (≥2 passes; capped at 10, or a
--benchmark-memory-max-time budget). Deterministic code settles in ~3 passes; noisy code runs
more. Set repeats=N (marker) or --benchmark-memory-repeats=N (suite) to force a fixed,
reproducible count — what CI gating against a saved baseline wants. Full rationale and the
noisy-workload guidance are in the guide: Repeats & adaptive sampling.
The benchmark_memory fixture¶
Depends on pytest-benchmark's benchmark fixture; measures peak in a separate untimed
pass, then times via pytest-benchmark.
Order — memory first (cold), then timing
Every call form runs the memray pass first, then pytest-benchmark's timing
(calibration + all rounds). This matters: timing runs the function thousands of times,
which grows and fragments the allocator's arenas — so measuring memory after timing
would report the warm plateau, not the fresh-process floor the headline min is meant
to be. Memory-first measures the cold cost (the warmup pass still sheds the one-time
cold-start within it); timing then runs cleanly, with no memray hooks active. This holds
for __call__, pedantic, and the --benchmark-memory patch alike. The standalone
measure_peak / measure_memory have no timing phase at all; warmup=0 skips the warmup,
repeats=N forces a fixed count.
Explicit control, like pytest-benchmark's pedantic plus a memory pass:
benchmark_memory.pedantic(target, args=(), kwargs=None, setup=None,
rounds=1, warmup_rounds=0, iterations=1)
setup— a callable run untracked before each measured call; if it returns(args, kwargs), those supply the call's arguments. Used for both the timed rounds and each (adaptive) memory sample — onesetuprebuilds fresh state for both — so a stateful action's memory samples stay independent. The same applies tobenchmark.pedantic(setup=…)under--benchmark-memory: no extra changes.rounds,warmup_rounds,iterations— as in pytest-benchmark.
Mostly memory, little timing? There's no memory-only switch — the entry rides
pytest-benchmark's timing. To trim it: --benchmark-min-rounds=1 --benchmark-max-time=0
(no test changes), or pedantic(rounds=1, warmup_rounds=0) for a single call. For pure
memory outside pytest, use measure_peak / measure_memory.
Attributes (available after a call):
| Attribute | What |
|---|---|
extra_info |
pytest-benchmark's per-benchmark dict. Set scalars here to attach analysis dims; the memory blob lands here under the benchmem key. |
peak_bytes |
peak memory (bytes) from the last call, or None before any call. |
result |
the full MemoryResult from the last call, or None. |
The extra_info.benchmem blob¶
Each measured benchmark stores this dict under extra_info["benchmem"] — three flat
per-repeat series, one entry per memray pass. Every reported number (headline peak =
min, any --stat) derives from these on read:
| Key | What |
|---|---|
peak_bytes |
per-repeat high-water of live bytes — the peak metric (headline = min) |
allocations |
per-repeat allocation count — the allocations metric |
total_bytes |
per-repeat total bytes allocated — the allocated metric (churn peak hides) |
rss_bytes |
per-repeat whole-process resident high-water (ru_maxrss) — the rss metric. Only present under isolate=True (each pass in a fresh process); absent otherwise. |
See Choosing a metric for when to reach for each, and --stat for distributions.
CLI — benchmem¶
Installed with pytest-benchmem[plot]. The full command tree and every option, captured
live from the typer app as it actually renders in a terminal:
Usage: benchmem [OPTIONS] COMMAND [ARGS]...
pytest-benchmem — plot and compare benchmark runs.
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the │
│ installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮
│ plot Render an interactive plotly view from one or more pytest-benchmark runs. │
│ compare Print a per-id table for one run, or compare two or more (and optionally gate CI). │
│ sweep Run a benchmark suite across several installed versions of a package. │
│ flamegraph Render a kept memory profile in one step — resolve the ``.bin`` for a test and run │
│ memray. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem plot [OPTIONS] RUNS...
Render an interactive plotly view from one or more pytest-benchmark runs.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * runs... PATH pytest-benchmark JSON file(s). [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --columns [time|peak|allocated|allocation Metric to plot: time | peak | │
│ s|rss] allocated | allocations | rss │
│ (rss = isolated runs only). One │
│ per figure (a plot has a single │
│ value axis) — same flag as │
│ `compare`; the spread shows as │
│ whiskers via --band. │
│ [default: time] │
│ --view TEXT compare | scatter | sweep | │
│ scaling (default: by count). │
│ --facet TEXT Dim to facet by. │
│ --x TEXT scaling: dim for the x-axis. │
│ --clip FLOAT Clamp the colour scale. │
│ --where TEXT Filter rows by dim: KEY=VALUE │
│ (repeatable, AND-combined). │
│ --free-axes [x|y|both] Free facet axes: x | y | both │
│ (needs --facet). │
│ --band [auto|minmax|none] scaling: spread whiskers on │
│ memory metrics — auto | minmax | │
│ none. │
│ [default: auto] │
│ --label -l TEXT Series label per run, in order │
│ (repeat). Default: stem. │
│ --output -o PATH HTML out. │
│ --open --no-open [default: no-open] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem compare [OPTIONS] RUNS...
Print a per-id table for one run, or compare two or more (and optionally gate CI).
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * runs... PATH One or more pytest-benchmark runs, oldest → newest. One prints a plain │
│ table; two or more compare (a sweep is N). │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --columns TEXT Comma list of metrics: time | peak | allocated | allocations | rss (rss │
│ = isolated runs only; e.g. peak or time,peak,rss). Default: time,peak. │
│ Each is shown across every --stat; a metric absent from every run is │
│ dropped. │
│ --group-by TEXT Group rows into sub-tables: fullname | name | func | group | module | │
│ class | param:NAME (comma-composable). │
│ [default: fullname] │
│ --stat TEXT Which stat column(s) per metric: min | max | mean | median | stddev, or │
│ all (the default) for the full spread side by side. │
│ --sort TEXT Row order: name (id) | value (largest in the last run) | change. │
│ [default: name] │
│ --csv PATH Also write the raw (unscaled) comparison to this CSV file. │
│ --fail-on TEXT Exit non-zero on a regression of the first run vs the last. │
│ FIELD:THRESHOLD, repeatable — e.g. --fail-on peak:10% --fail-on │
│ peak:5MiB --fail-on rss:10% (rss gates only isolated runs). │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem sweep [OPTIONS] PACKAGE VERSIONS...
Run a benchmark suite across several installed versions of a package.
Provisions one fresh uv venv per version, runs 'pytest <suite> --benchmark-only'
in each writing <out>/<version>.json, then prints the next step. --memory adds
the memory pass; forward any other pytest flag with --pytest-arg, e.g.
benchmem sweep mypkg 1.2.0 1.3.0 --suite benchmarks/ --memory --pytest-arg=-k.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * package TEXT Package under test; each plain version installs `<package>==<v>`. │
│ [required] │
│ * versions... TEXT Versions or pip specs to sweep, e.g. 1.2.0 1.3.0 │
│ git+https://github.com/me/pkg@main. │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ * --suite PATH Benchmark suite (dir or file) to run in each version's │
│ venv. │
│ [required] │
│ --out PATH Directory for the per-version JSON runs. │
│ [default: .benchmarks/sweep] │
│ --memory --no-memory Add --benchmark-memory to each pytest run. │
│ [default: no-memory] │
│ --pytest-arg TEXT Arg forwarded to pytest, one token each, repeatable │
│ (e.g. --pytest-arg=-k). │
│ --pin TEXT Extra pip spec installed alongside (repeatable). │
│ --as-of TEXT YYYY-MM-DD for uv --exclude-newer (reproducible │
│ resolve). │
│ --import-check TEXT Module asserted to resolve to the venv (isolation │
│ preflight). │
│ --copy-dir PATH Directory copied into each venv's cwd (the suite │
│ imports from here). │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Usage: benchmem flamegraph [OPTIONS] PROFILE_DIR [TEST_ID]
Render a kept memory profile in one step — resolve the ``.bin`` for a test and run memray.
Closes the "regressed → *where*?" loop after ``--benchmark-memory-profile``: instead of
finding the right ``.bin`` and remembering the memray subcommand, point at the profile dir
and name the test (or ``--worst peak`` to auto-pick the heaviest). Defaults to an HTML
flamegraph written next to the ``.bin``; ``--report tree|summary|stats`` prints to the
terminal instead.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────╮
│ * profile_dir PATH Directory of kept .bin profiles (--benchmark-memory-profile). │
│ [required] │
│ [test_id] TEXT Test id (exact, or a unique substring) to render; omit with --worst. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --worst TEXT Auto-pick the heaviest: peak | allocated | allocations │
│ --report TEXT memray reporter: flamegraph | table | tree | summary | stats │
│ [default: flamegraph] │
│ --native --no-native Require the profile to carry native traces (captured via │
│ --benchmark-memory-profile-native); error if it doesn't. │
│ [default: no-native] │
│ --output -o PATH HTML out path (default: next to the .bin). │
│ --open --no-open Open the rendered HTML. [default: no-open] │
│ --force -f Overwrite an existing render. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Public Python API¶
Light to import — pytest_benchmem re-exports only the engine and the readers;
pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep shells to uv,
so import those submodules directly.
Engine¶
measure_peak ¶
Run action() under memray.Tracker and return peak bytes.
The bare one-liner for a REPL or notebook; :func:measure_memory returns the
full result (allocation count, spread). repeats behaves as there — None
(default) samples adaptively, an int forces a fixed pass count.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action
|
Action
|
The zero-argument callable to measure. |
required |
repeats
|
int | None
|
Fixed pass count, or |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Peak bytes (the headline |
measure_memory ¶
measure_memory(
action: Action,
repeats: int | None = None,
*,
warmup: int = _DEFAULT_WARMUP,
isolate: bool = False,
max_time: float | None = None,
min_passes: int = _ADAPTIVE_MIN_PASSES,
max_passes: int = _ADAPTIVE_MAX_PASSES,
patience: int = _ADAPTIVE_PATIENCE,
keep_bin: Path | None = None,
native: bool = False,
setup: Action | None = None,
) -> MemoryResult
Run action() under memray.Tracker → :class:MemoryResult, one pass per repeat.
warmup untracked dry-runs run first to shed one-time costs; then each measured pass
gets a fresh tracker. The headline is the min across passes (see :class:MemoryResult);
every pass's :class:Measurement is kept for spread stats.
With isolate=True each measured pass runs in a fresh spawned process (each warming
itself), and that child's whole-process resident high-water (ru_maxrss) is recorded as
:attr:Measurement.rss_bytes — a physical-memory reading attributable to the action, which
an in-process pass can't give. The action (and setup) must be picklable (a top-level
callable, not a lambda/closure); keep_bin is ignored in this mode.
Two modes, by repeats:
repeats=N(an int) — run exactlyNpasses. Fixed and reproducible; what CI gating and saved-baseline comparisons want.repeats=None(default) — sample adaptively: keep running passes until the min stops moving (no new low forpatiencepasses), bounded bymin_passes(≥2),max_passes, and an optionalmax_timebudget. Deterministic code settles in a few passes; noisy code runs more.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action
|
Action
|
The zero-argument callable to measure. |
required |
repeats
|
int | None
|
Fixed pass count, or |
None
|
warmup
|
int
|
Untracked dry-runs ( |
_DEFAULT_WARMUP
|
isolate
|
bool
|
Run each pass in a fresh spawned process and record its |
False
|
max_time
|
float | None
|
Wall-clock budget (seconds) for adaptive sampling; |
None
|
min_passes
|
int
|
Minimum passes when sampling adaptively. |
_ADAPTIVE_MIN_PASSES
|
max_passes
|
int
|
Hard ceiling on passes when sampling adaptively. |
_ADAPTIVE_MAX_PASSES
|
patience
|
int
|
Stop adaptive sampling after this many consecutive passes with no new min. |
_ADAPTIVE_PATIENCE
|
keep_bin
|
Path | None
|
If set, the first pass's profile |
None
|
native
|
bool
|
Capture native (C/C++/Rust) stacks in the kept |
False
|
setup
|
Action | None
|
Optional zero-arg callable run untracked before each pass (and each warmup
run) — its allocations are not measured. Use it to rebuild fresh state so a stateful
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
MemoryResult
|
class: |
MemoryResult
dataclass
¶
A memory measurement across repeats passes, derived from the per-repeat samples.
The per-repeat :attr:samples are the single source of truth — that's all the blob
stores (the series); everything else is derived from them on read.
The headline :attr:peak_bytes is the minimum peak across passes — the fresh-process
floor, unbiased by the in-process warm plateau (repeated runs fragment/grow arenas and
allocate more) that a central stat would report. :attr:allocations / :attr:total_bytes
come from that same min-peak run (a coherent snapshot); :attr:peak_bytes_max is the worst
peak, so the spread is visible. A warm-plateau / steady-state read is available via the
mean / median --stat. A single pass collapses all of these to its own values.
representative
property
¶
The min-peak run — the one the headline peak/allocations/total_bytes come from.
peak_bytes
property
¶
The headline peak — the minimum high-water across passes (the fresh-process floor).
peak_bytes_max
property
¶
The worst peak across repeats (equals :attr:peak_bytes with one repeat).
rss_bytes
property
¶
Headline whole-process RSS — the minimum ru_maxrss across isolated passes
(the cold floor, like :attr:peak_bytes), or None if memory wasn't measured in
isolation (in-process has no attributable process-global RSS).
series ¶
The per-repeat values of one series field (SERIES_FIELDS or optional).
as_dict ¶
The JSON blob stored under pytest-benchmark extra_info["benchmem"].
The three core per-repeat series, flat, plus any :data:OPTIONAL_SERIES_FIELDS
that were measured (all-or-nothing per result). No denormalized scalars and no
repeats (it's len of any series). Everything else derives on read.
from_blob
classmethod
¶
Rebuild from a blob's per-repeat series. Core columns are required; any
:data:OPTIONAL_SERIES_FIELDS are read when present (else left None).
Measurement
dataclass
¶
One repeat's raw numbers — memray's peak high-water, allocation count, and total bytes allocated (cumulative churn, incl. temporaries GC later frees), plus an optional whole-process resident high-water.
rss_bytes is getrusage's ru_maxrss from an isolated pass (a fresh child
process); None in-process, where a process-global RSS isn't attributable to the action.
Readers & loader¶
from_pytest_benchmark reads timing (seconds, from stats);
memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem).
load_samples is the unified reader; load_long_df stacks runs into the tidy frame the
plots pivot. discover_runs collects saved runs from pytest-benchmark's .benchmarks/
storage, so you can hand the readers a directory instead of listing files.
from_pytest_benchmark ¶
Read timing out of a pytest-benchmark file → (label, samples, "s").
Dims come from each benchmark's parametrize params and extra_info, plus
the structural node.* dims (see :func:_node_dims).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
A pytest-benchmark JSON file. |
required |
metric
|
str
|
Which pytest-benchmark stat to read ( |
'min'
|
Returns:
| Type | Description |
|---|---|
str
|
|
list[Sample]
|
and the unit ( |
memory_from_pytest_benchmark ¶
memory_from_pytest_benchmark(
path: str | Path,
*,
field: str = "peak_bytes",
reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]
Read memory out of a pytest-benchmark file → (label, samples, unit).
The benchmark_memory fixture stores each run's memory blob under
extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same
benchmark id pytest-benchmark uses. Benchmarks lacking the blob (timing-only tests)
are skipped. Dims come from parametrize params and extra_info, plus the
structural node.* dims (see :func:_node_dims).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
A pytest-benchmark JSON file. |
required |
field
|
str
|
Which series to read — |
'peak_bytes'
|
reduce
|
Callable[[list[float]], float] | None
|
Reduce the per-repeat series to one scalar. Default ( |
None
|
Returns:
| Type | Description |
|---|---|
str
|
|
list[Sample]
|
with the blob, and the unit ( |
load_samples ¶
load_samples(
path: str | Path,
*,
metric: Metric = "time",
stat: str | None = None,
) -> tuple[str, list[Sample], str]
Read one pytest-benchmark file for the chosen metric → (label, samples, unit).
The unified reader over :func:from_pytest_benchmark (timing) and
:func:memory_from_pytest_benchmark (memory).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
A pytest-benchmark JSON file. |
required |
metric
|
Metric
|
Which metric to read ( |
'time'
|
stat
|
str | None
|
Distribution stat over the metric's per-repeat series ( |
None
|
Returns:
| Type | Description |
|---|---|
tuple[str, list[Sample], str]
|
|
load_long_df ¶
load_long_df(
runs: str | Path | Sequence[str | Path],
*,
metric: Metric = "time",
stat: str | None = None,
labels: Sequence[str] | None = None,
) -> tuple[pd.DataFrame, str]
Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).
One row per (run, id) for the chosen metric. Columns: snapshot
(the series/version label), id, value, then one column per dim key seen
(missing dims are NaN). Every plot view pivots this frame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
runs
|
str | Path | Sequence[str | Path]
|
One path or a sequence of pytest-benchmark JSON files. |
required |
metric
|
Metric
|
Which metric to read ( |
'time'
|
stat
|
str | None
|
Distribution stat over the per-repeat series; |
None
|
labels
|
Sequence[str] | None
|
Overrides the |
None
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, str]
|
|
discover_runs ¶
Return pytest-benchmark JSON files under root (for CLI suggestions).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str | Path
|
Directory to search (default: pytest-benchmark's |
'.benchmarks'
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
The JSON file paths found under |
Sample ¶
Bases: NamedTuple
One measured result: an opaque id, a value, and analysis dims.
Plotting — pytest_benchmem.plotting¶
Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths;
labels names the series per run (defaults to the file stems) — the API behind plot's
-l/--label. plot_compare's sort is "absolute" (native units) or "relative"
(percent).
plot_scaling ¶
plot_scaling(
snapshots: Snapshots,
*,
metric: Metric = "time",
x: str | None = None,
color: str | None = None,
facet: str | None = None,
log: bool | Literal["auto"] = "auto",
band: Literal["auto", "minmax", "none"] = "auto",
where: Mapping[str, str] | None = None,
free_axes: FreeAxes | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Cost vs a numeric dim, coloured/faceted by other dims.
x/color/facet default to inference from the dims (the lone numeric
dim → x); pass them to override.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON path(s). |
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
x
|
str | None
|
Dim for the x-axis (default: the lone numeric dim). |
None
|
color
|
str | None
|
Dim to colour by (default: inferred). |
None
|
facet
|
str | None
|
Dim to split into subplots (default: inferred). |
None
|
log
|
bool | Literal['auto']
|
|
'auto'
|
band
|
Literal['auto', 'minmax', 'none']
|
Spread whiskers ( |
'auto'
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
free_axes
|
FreeAxes | None
|
|
None
|
labels
|
Sequence[str] | None
|
Names the snapshot in the title (default: file stem). |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
plot_scatter ¶
plot_scatter(
snapshots: Snapshots,
*,
metric: Metric = "time",
facet: str | None = None,
clip: float | None = None,
where: Mapping[str, str] | None = None,
free_axes: FreeAxes | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Baseline cost (log-x) vs candidate/baseline ratio (log-y).
Top-right = slow and slower (the regressed corner). The first snapshot is the baseline; with 3+, the rest animate. Colour encodes the absolute Δ.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON path(s); the first is the baseline, extras animate. |
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
facet
|
str | None
|
Dim to split into subplots. |
None
|
clip
|
float | None
|
Clamp the colour scale (default p95). |
None
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
free_axes
|
FreeAxes | None
|
Give each facet its own axes instead of sharing. |
None
|
labels
|
Sequence[str] | None
|
Series names per run (default: file stems). |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
plot_compare ¶
plot_compare(
snapshots: Snapshots,
*,
metric: Metric = "time",
sort: SortMode = "absolute",
facet: str | None = None,
clip: float | None = None,
where: Mapping[str, str] | None = None,
free_axes: FreeAxes | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).
The first two snapshots are compared; the first is the baseline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON path(s); only the first two are used. |
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
sort
|
SortMode
|
|
'absolute'
|
facet
|
str | None
|
Dim to split into subplots. |
None
|
clip
|
float | None
|
Clamp the colour scale (default symmetric p95). |
None
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
free_axes
|
FreeAxes | None
|
Give each facet its own axes instead of sharing. |
None
|
labels
|
Sequence[str] | None
|
Series names for the two runs (default: file stems). |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
plot_sweep ¶
plot_sweep(
snapshots: Snapshots,
*,
metric: Metric = "time",
clip: float | None = None,
where: Mapping[str, str] | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
snapshots
|
Snapshots
|
Run JSON paths; columns in order, the first is the reference. |
required |
metric
|
Metric
|
Which metric to plot ( |
'time'
|
clip
|
float | None
|
Clamp the colour scale. |
None
|
where
|
Mapping[str, str] | None
|
Keep only rows matching these |
None
|
labels
|
Sequence[str] | None
|
Column (version) names (default: file stems). |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Figure, int]
|
|
Sweeps — pytest_benchmem.sweep¶
See Cross-version sweeps for the narrative, the Venv object, and the
provision parameters.
sweep ¶
sweep(
versions: Sequence[str],
run: Callable[[Venv], None],
**provision_kwargs: object,
) -> list[str]
Provision a venv per version and call run(venv) in each.
run does whatever the consumer needs (invoke pytest / a memory command
with venv.python and cwd=venv.cwd). Returns the list of versions
that failed to provision.