Does It Matter How You Pick Your Batches? Sampling Methods in DP-SGD

Andy Cai, Yvonne Yifeng Peng

Published on 08.04.2026

Introduction

If you have ever trained a machine learning model and heard the phrase "we use Differential Privacy to protect user data," you might have assumed that the math behind it is airtight. In theory, it often is. In practice, however, a surprising number of real-world deployments quietly cut a corner that the theory never accounted for — and the impact on your privacy can be substantial.

This post is about one such corner: how batches of training data are sampled. Specifically, we compare three strategies — Poisson subsampling (what the theory assumes), shuffle-without-replacement (WOR, what most engineers actually implement), and Balls-and-Bins (BB, a recent proposal that tries to get the best of both worlds) — and explain why treating these as interchangeable is a mistake that can silently inflate your real privacy leakage by several times the advertised value.

We also go one step further. Once we have all three samplers on equal footing, we ask: can we squeeze out even more utility without paying extra in privacy? We look at two symmetry-based strategies — data augmentation (creating extra training examples by rotating images) and invariant architecture (training a neural network that treats rotated inputs as equivalent by design) — and compare how all three samplers hold up under each strategy, all held to the same privacy guarantees.

The road map for this post: we start with a quick refresher on differential privacy and DP-SGD, then explain why the gap between Poisson and Shuffle-WOR is a real problem that the field has been too quick to dismiss. From there, we introduce all three sampling schemes side by side, walk through the rotational symmetry strategies we layer on top, explain how we use Privacy Loss Distributions (PLDs) to account for privacy on equal footing, and finally compare results across all three samplers under the same fixed $(\varepsilon, \delta)$ target.

A Quick Refresher: Differential Privacy and DP-SGD

This post assumes you are comfortable with the basics of differential privacy. If not, the PyTorch blog series is a great starting point: Differential Privacy Series Part 1 — DP-SGD Algorithm Explained.

Differential Privacy

Differential Privacy (DP) is a mathematical guarantee that an algorithm's output is nearly indistinguishable whether or not any single individual's data is included. Formally, a mechanism $M$ satisfies $(\varepsilon, \delta)$-DP if for any two neighboring datasets $x$ and $x'$ differing on one record, and any set of outputs $E$:

$$\Pr[M(x) \in E] \leq e^{\varepsilon} \cdot \Pr[M(x') \in E] + \delta$$

The parameters $\varepsilon$ (epsilon) and $\delta$ (delta) quantify the privacy loss — smaller is better. Intuitively: even the best possible classifier trying to distinguish outputs from $M(x)$ vs. $M(x')$ must make a trade-off between its true positive rate (TPR) and false positive rate (FPR) bounded by $\text{FPR} \geq \frac{\text{TPR} - \delta}{e^\varepsilon}$.

DP-SGD

DP-SGD [1] is the standard recipe for training neural networks with formal privacy guarantees. At each step it draws a random batch, clips every per-sample gradient to bound its influence, and adds calibrated Gaussian noise before updating the model. The core parameters are:

Symbol	Role
$C$	Gradient clipping threshold — caps any one record's influence
$\sigma$	Noise multiplier — scales the injected Gaussian noise
$q = L/n$	Sampling rate — fraction of the dataset seen per step
$T$	Total number of training steps
$(\varepsilon, \delta)$	Output privacy cost, computed after all $T$ steps by a privacy accountant
$\mathcal{B}$	Batch sampler ← our focus

At each training step $t$:

Sample a mini-batch $S_t$ via sampler $\mathcal{B}$
Compute per-sample gradients $\mathbf{g}_t(x_i)$
Clip & Noise

$$ \tilde{\mathbf{g}}_t \leftarrow \frac{1}{L}\left(\sum_i \bar{\mathbf{g}}_t(x_i) + \mathcal{N}(0,, \sigma^2 C^2 \mathbf{I})\right) $$

Descend

$$ \theta_{t+1} \leftarrow \theta_t - \eta_t\tilde{\mathbf{g}}_t $$

Notice that Sample and Clip & Noise injection are the only two steps that actually create privacy protection. Everything else — gradient computation, the model update — is either pure math or post-processing of an already-noisy signal. Post-processing never weakens a DP guarantee, so it has no bearing on $(\varepsilon, \delta)$ at all. In practice the noise multipler is often fixed for utility requirements, That leaves only the batch sampler as the only degree of freedom, sitting quietly at the top of the loop: how do you pick the batch in the first place?

Problems, Motivation, and Research Focus

Privacy accountants assume Poisson subsampling — the version of DP-SGD where each record independently flips a coin at every step. When used correctly, it comes with a mathematical certificate. Train at noise level $\sigma$, run the privacy accountant, and out comes an $(\varepsilon, \delta)$ bound you can point to. But virtually every real training loop uses shuffle-without-replacement instead, because it is faster, GPU-friendly, and already built into every standard data loader. And when you ask the libraries about this, the answer is telling. Here is what Opacus says in its own documentation:

poisson_sampling (bool) – True if you want to use standard sampling required for DP guarantees. Setting False will leave provided data_loader unchanged. Technically this doesn't fit the assumptions made by privacy accounting mechanism, but it can be a good approximation when using Poisson sampling is unfeasible.

Major tech companies (Google, Apple, Meta) have deployed it at scale, and popular libraries like Opacus and TensorFlow Privacy have made it accessible to any practitioner. Given the description above, the library that millions of practitioners use to train private models explicitly acknowledges that its default mode "doesn't fit the assumptions" of its own accountant — and suggests treating it as a "good enough approximation" anyway. For a long time, the field broadly agreed: the gap was probably small enough to ignore. This leaves practitioners in an uncomfortable position — the same model, trained with the same pipeline, can end up carrying a fundamentally incorrect privacy guarantee without anyone realizing it. So how bad can it actually get?

Recent theoretical work [2] showed that the lower bound on Shuffle-WOR's privacy cost can be up to 10× larger than what a Poisson accountant reports under the same parameters. Empirical auditing [3] confirmed this is not just a mathematical artifact — in practice, audited $\varepsilon$ values have come in 4–10× higher than the advertised Poisson guarantee. A model certified at $\varepsilon = 0.1$ was found to leak at the level of $\varepsilon_\text{emp} \approx 1.0$. That is not a rounding error. That is an order-of-magnitude gap in your privacy bill.

This is the problem we want to take seriously. Rather than accepting "good enough" as an answer, our goal is to put all three sampling strategies — Poisson, Shuffle-WOR, and a newer alternative called Balls-and-Bins — on equal footing and actually measure the differences. Balls-and-Bins, introduced by Chua et al. [4] and given tight formal bounds by Feldman & Shenfeld [7], offers a principled middle path between the theoretical cleanliness of Poisson and the practical convenience of Shuffle-WOR.

Our goal is to make this three-way comparison concrete: understand its theoretical roots, reproduce the accounting bounds for all three methods, explore two symmetry-based amplification strategies, and make the whole picture accessible to engineers who want to audit their own pipelines.

Our Research Question

Given that batch samplers matter so much, here is what we set out to answer:

How do Poisson, Shuffle-WOR, and Balls-and-Bins compare — on a simple baseline task, and on a rotationally invariant version of that task (using data augmentation vs. an invariant architecture) — when held to the same privacy guarantees and parameters?

To answer this, we need to (a) put all three samplers on equal footing with a unified measurement framework, and (b) re-derive the noise requirements for each under the same $(\varepsilon, \delta)$ target. Let's start by understanding what each sampler actually does.

The Three Ways to Pick a Batch

Let’s first look at the three main ways DP-SGD can form a batch.

Sampler 1: Poisson Subsampling

In Poisson subsampling, every data point independently flips a biased coin at each training step. With probability $q = B/n$ (where $B$ is the expected batch size and $n$ is the dataset size), a record is included in the current batch; otherwise, it is skipped. Because records are sampled independently, different steps produce batches of varying sizes, and a given record can appear in zero, one, or many batches in a single epoch.

Pros: Strong privacy amplification theorem — tight upper bounds are tractable with tools like the Privacy Loss Distribution (PLD) accountant [1].

Cons: Requires random access to the full dataset at every step; produces variable-length batches that are slow on GPU pipelines [3].

Algorithm: Poisson Sampler
At each step t:
  For each data point i = 1, ..., n:
    Include i in S_t with probability q, independently

💡 live Demo - Poisson

Poisson Subsampling q = 0.2

Each of the 10 balls enters each of the 5 batches independently with probability q = 0.2. A ball may appear in multiple batches — or none.

—Total placements

—Canary batches

—Empty batches

Ready — click Step to place one ball at a time.

Sampler 2: Shuffle Without Replacement (WOR)

Shuffling is what virtually every non-private training loop already does: randomly permute the full dataset once per epoch, then slice it into fixed-size batches and iterate through them in order. Because the permutation is done without replacement, every record appears in exactly one batch per epoch — no more, no less.

Pros: Fast, memory-friendly, and drops directly into existing non-private codebases. Also known to converge faster than Poisson-based SGD [6].

Cons: Weaker privacy guarantee than Poisson; only a lower bound (not a tight upper bound) on $\varepsilon$ is currently known [3].

Algorithm: Shuffle-WOR Sampler
At the start of each epoch:
  Draw a random permutation π of {1, ..., n}
  Assign records π((t-1)·B + 1), ..., π(t·B) to batch S_t

💡 Live Demo： Shuffle - WOR:

Shuffle Without Replacement 2 per batch

Balls are shuffled once then sliced into fixed batches of 2. Every ball appears exactly once, but positions are correlated — knowing one ball's batch reveals where others are not.

Shuffle order (left → right)

—Total placements

—Canary in batch

—Batches full

Ready — click Step to place one ball at a time.

Sampler 3: Balls-and-Bins (BB)

Balls-and-Bins is the newest of the three. Think of each data point as a ball, and each training step as a bin. At the start of every epoch, each ball is independently and uniformly tossed into one of the $T$ bins — meaning each record is assigned to exactly one training step, chosen uniformly at random. The key word is independently: unlike shuffle, knowing where record $i$ landed tells you nothing about where record $j$ landed.

Algorithm: Balls-and-Bins Sampler
At the start of each epoch:
  Initialize all S_t = ∅
  For each data point i = 1, ..., n:
    Draw t uniformly at random from {1, ..., T}
    Add i to S_t

Pros: Every record appears exactly once per epoch (like shuffle), so the training loop looks the same. Privacy amplification is comparable to — or better than — Poisson in practical regimes [4], [7].

Cons: Batch sizes still vary (like Poisson), so it requires slight infrastructure changes compared to shuffle.

💡 Live Demo: Balls-and-Bins

Balls-and-Bins Sampling ✦ Best of both

Each ball independently picks one of the 5 batches uniformly at random. Every ball appears exactly once and placements are mutually independent — combining the strengths of Poisson and Shuffle.

—Total placements

—Canary in batch

—Max batch size

Ready — click Step to place one ball at a time.

Why This Distinction Matters for Privacy

Now that the mechanics are clear, the privacy consequence follows naturally. The key difference is not just how records are assigned to batches, but what that assignment can reveal to an adversary.

Under Poisson, each step is independent. A canary may appear many times, once, or not at all, so observing many batches still does not let the adversary pin down where the canary was. Under Shuffle / WOR, however, the canary is guaranteed to appear exactly once per epoch. That means once the other batches look normal, the remaining unexplained batch becomes highly suspicious. In the words of the 2024 analysis, under shuffling the non-differing examples can leak information about the location of the differing example, while this is not the case under Poisson.

So even with the same noise and the same nominal training setup, the sampler alone can change the privacy cost. This is why using shuffling in training but Poisson in accounting can be misleading.

Comparing on Equal Footing: Zero-Out Adjacency

Before we can meaningfully compare the privacy cost of all three samplers, we need to make sure we are measuring privacy the same way for all three. This is trickier than it sounds, because different schemes naturally lend themselves to different adjacency definitions — the formal notion of what it means for two datasets to be "neighbors."

Three Notions of Adjacency

Add/remove adjacency says two datasets are neighbors if one can be obtained from the other by adding or removing a single record. Natural for Poisson (batch sizes vary, so it is natural to ask "is this record in the dataset or not?"), but the two datasets have different sizes — which naturally fits in Poisson sampling but breaks fixed-size-batch schemes like shuffle.

Substitution (edit) adjacency says two datasets are neighbors if one record is replaced by a different record. Both datasets have the same size, which fits shuffle, but it is not directly comparable to add/remove: a substitution is effectively two add/remove operations in a row, requiring roughly twice the noise — making comparisons unfair [5].

Zero-out adjacency bridges the gap. One dataset $D$ contains a real record $x_i$; its neighbor $D'$ is identical except that $x_i$ is replaced by a special "null" record $\perp$ whose gradient contribution is always exactly zero:

$$\nabla \ell(\perp;, \theta) \equiv 0 \quad \text{for all } \theta$$

Both datasets have the same size, and the effect of $\perp$ is semantically equivalent to the record simply not being there. Crucially, DP-SGD under Poisson with zero-out adjacency is equivalent to DP-SGD with add/remove adjacency, so we lose nothing by using it. And because zero-out keeps dataset sizes constant, it applies cleanly to shuffle and BB as well.

Zero-out adjacency gives us a single, consistent measuring stick across all three sampling schemes — and it is the adjacency notion we use throughout our experiments.

Does It Matter How You Pick Your Batches? Sampling Methods in DP-SGD

Introduction

A Quick Refresher: Differential Privacy and DP-SGD

Differential Privacy

DP-SGD

Problems, Motivation, and Research Focus

Our Research Question

The Three Ways to Pick a Batch

Sampler 1: Poisson Subsampling

Poisson Subsampling q = 0.2

Sampler 2: Shuffle Without Replacement (WOR)

Shuffle Without Replacement 2 per batch

Sampler 3: Balls-and-Bins (BB)

Balls-and-Bins Sampling ✦ Best of both

Why This Distinction Matters for Privacy

Comparing on Equal Footing: Zero-Out Adjacency

Three Notions of Adjacency

Our Accounting Tool: Privacy Loss Distributions (PLDs)

Going Further: Two Ways to Exploit Rotational Symmetry

Strategy 1: Data Augmentation

Strategy 2: Invariant Architecture

The Accounting Trade-off

Experimental Setup

Results

Limitations and Future Work

Summary and Takeaways

Further Reading

Further Reading & References