Skip to content

Does It Matter How You Pick Your Batches?

Andy Cai, Yvonne Yifeng Peng
Published on 

Introduction

If you have ever trained a machine learning model and heard the phrase "we use Differential Privacy to protect user data," you might have assumed that the math behind it is airtight. In theory, it often is. In practice, however, a surprising number of real-world deployments quietly cut a corner that the theory never accounted for — and the impact on your privacy can be substantial.

This post is about one such corner: how batches of training data are sampled. Specifically, we compare three strategies — Poisson subsampling (what the theory assumes), shuffle-without-replacement (WOR, what most engineers actually implement), and Balls-and-Bins (BB, a recent proposal that tries to get the best of both worlds) — and explain why treating these as interchangeable is a mistake that can silently inflate your real privacy leakage by several times the advertised value.

We also go one step further. Once we have all three samplers on equal footing, we ask: can we squeeze out even more utility without paying extra in privacy? We look at two symmetry-based strategies — data augmentation (creating extra training examples by rotating images) and invariant architecture (training a neural network that treats rotated inputs as equivalent by design) — and compare how all three samplers hold up under each strategy, all held to the same privacy guarantees.

The road map for this post: we start with a quick refresher on differential privacy and DP-SGD, then explain why the gap between Poisson and Shuffle-WOR is a real problem that the field has been too quick to dismiss. From there, we introduce all three sampling schemes side by side, walk through the rotational symmetry strategies we layer on top, explain how we use Privacy Loss Distributions (PLDs) to account for privacy on equal footing, and finally compare results across all three samplers under the same fixed $(\varepsilon, \delta)$ target.


A Quick Refresher: Differential Privacy and DP-SGD

This post assumes you are comfortable with the basics of differential privacy. If not, the PyTorch blog series is a great starting point: Differential Privacy Series Part 1 — DP-SGD Algorithm Explained.

Differential Privacy

Differential Privacy (DP) is a mathematical guarantee that an algorithm's output is nearly indistinguishable whether or not any single individual's data is included. Formally, a mechanism $M$ satisfies $(\varepsilon, \delta)$-DP if for any two neighboring datasets $x$ and $x'$ differing on one record, and any set of outputs $E$:

$$\Pr[M(x) \in E] \leq e^{\varepsilon} \cdot \Pr[M(x') \in E] + \delta$$

The parameters $\varepsilon$ (epsilon) and $\delta$ (delta) quantify the privacy loss — smaller is better. Intuitively: even the best possible classifier trying to distinguish outputs from $M(x)$ vs. $M(x')$ must make a trade-off between its true positive rate (TPR) and false positive rate (FPR) bounded by $\text{FPR} \geq \frac{\text{TPR} - \delta}{e^\varepsilon}$.

DP-SGD

DP-SGD [1] is the standard recipe for training neural networks with formal privacy guarantees. At each step it draws a random batch, clips every per-sample gradient to bound its influence, and adds calibrated Gaussian noise before updating the model. The core parameters are:

SymbolRole
$C$Gradient clipping threshold — caps any one record's influence
$\sigma$Noise multiplier — scales the injected Gaussian noise
$q = L/n$Sampling rate — fraction of the dataset seen per step
$T$Total number of training steps
$(\varepsilon, \delta)$Output privacy cost, computed after all $T$ steps by a privacy accountant
$\mathcal{B}$Batch sampler ← our focus

At each training step $t$:

  1. Sample a mini-batch $S_t$ via sampler $\mathcal{B}$

  2. Compute per-sample gradients $\mathbf{g}_t(x_i)$

  3. Clip & Noise

$$ \tilde{\mathbf{g}}_t \leftarrow \frac{1}{L}\left(\sum_i \bar{\mathbf{g}}_t(x_i) + \mathcal{N}(0,, \sigma^2 C^2 \mathbf{I})\right) $$

  1. Descend

$$ \theta_{t+1} \leftarrow \theta_t - \eta_t\tilde{\mathbf{g}}_t $$

Notice that Sample and Clip & Noise injection are the only two steps that actually create privacy protection. Everything else — gradient computation, the model update — is either pure math or post-processing of an already-noisy signal. Post-processing never weakens a DP guarantee, so it has no bearing on $(\varepsilon, \delta)$ at all. In practice the noise multipler is often fixed for utility requirements, That leaves only the batch sampler as the only degree of freedom, sitting quietly at the top of the loop: how do you pick the batch in the first place?


Problems, Motivation, and Research Focus

Privacy accountants assume Poisson subsampling — the version of DP-SGD where each record independently flips a coin at every step. When used correctly, it comes with a mathematical certificate. Train at noise level $\sigma$, run the privacy accountant, and out comes an $(\varepsilon, \delta)$ bound you can point to. But virtually every real training loop uses shuffle-without-replacement instead, because it is faster, GPU-friendly, and already built into every standard data loader. And when you ask the libraries about this, the answer is telling. Here is what Opacus says in its own documentation:

poisson_sampling (bool) – True if you want to use standard sampling required for DP guarantees. Setting False will leave provided data_loader unchanged. Technically this doesn't fit the assumptions made by privacy accounting mechanism, but it can be a good approximation when using Poisson sampling is unfeasible.

Major tech companies (Google, Apple, Meta) have deployed it at scale, and popular libraries like Opacus and TensorFlow Privacy have made it accessible to any practitioner. Given the description above, the library that millions of practitioners use to train private models explicitly acknowledges that its default mode "doesn't fit the assumptions" of its own accountant — and suggests treating it as a "good enough approximation" anyway. For a long time, the field broadly agreed: the gap was probably small enough to ignore. This leaves practitioners in an uncomfortable position — the same model, trained with the same pipeline, can end up carrying a fundamentally incorrect privacy guarantee without anyone realizing it. So how bad can it actually get?

Recent theoretical work [2] showed that the lower bound on Shuffle-WOR's privacy cost can be up to 10× larger than what a Poisson accountant reports under the same parameters. Empirical auditing [3] confirmed this is not just a mathematical artifact — in practice, audited $\varepsilon$ values have come in 4–10× higher than the advertised Poisson guarantee. A model certified at $\varepsilon = 0.1$ was found to leak at the level of $\varepsilon_\text{emp} \approx 1.0$. That is not a rounding error. That is an order-of-magnitude gap in your privacy bill.

This is the problem we want to take seriously. Rather than accepting "good enough" as an answer, our goal is to put all three sampling strategies — Poisson, Shuffle-WOR, and a newer alternative called Balls-and-Bins — on equal footing and actually measure the differences. Balls-and-Bins, introduced by Chua et al. [4] and given tight formal bounds by Feldman & Shenfeld[7], offers a principled middle path between the theoretical cleanliness of Poisson and the practical convenience of Shuffle-WOR.

Our goal is to make this three-way comparison concrete: understand its theoretical roots, reproduce the accounting bounds for all three methods, explore two symmetry-based amplification strategies, and make the whole picture accessible to engineers who want to check their own pipelines.


Our Research Question

Given that batch samplers matter so much, here is what we set out to answer:

How do Poisson, Shuffle-WOR, and Balls-and-Bins compare — on a simple baseline task, and on a rotationally invariant version of that task (using data augmentation vs. an invariant architecture) — when held to the same privacy guarantees and parameters?

To answer this, we need to (a) put all three samplers on equal footing with a unified measurement framework, and (b) re-derive the noise requirements for each under the same $(\varepsilon, \delta)$ target. Let's start by understanding what each sampler actually does.


The Three Ways to Pick a Batch


Sampler 1: Poisson Subsampling

In Poisson subsampling, every data point independently flips a biased coin at each training step. With probability $q = B/n$ (where $B$ is the expected batch size and $n$ is the dataset size), a record is included in the current batch; otherwise, it is skipped. Because records are sampled independently, different steps produce batches of varying sizes, and a given record can appear in zero, one, or many batches in a single epoch.

Pros: Strong privacy amplification theorem — tight upper bounds are tractable with tools like the PLD accountant [1].

Cons: Requires random access to the full dataset at every step; produces variable-length batches that are slow on GPU pipelines [3].

Algorithm: Poisson Sampler
At each step t:
  Initialize S_t = ∅
  For each data point i = 1, ..., n:
    With probability q, add i to S_t (independently)

💡 Live Demo - Poisson

Poisson Subsampling q = 0.2

Each of the 10 balls enters each of the 5 batches independently with probability q = 0.2. A ball may appear in multiple batches — or none.
Total placements
Canary batches
Empty batches
Ready — click Step to place one ball at a time.

Sampler 2: Shuffle Without Replacement (WOR)

Shuffling is what virtually every non-private training loop already does: randomly permute the full dataset once per epoch, then slice it into fixed-size batches and iterate through them in order. Because the permutation is done without replacement, every record appears in exactly one batch per epoch — no more, no less.

Pros: Compared with Poisson subsampling, shuffle without replacement is more implementation-friendly because it uses fixed-size batches and fits existing training pipelines better [3][6].

Cons: Weaker privacy guarantee than Poisson; only a lower bound (not a tight upper bound) on $\varepsilon$ is currently known [3].

Algorithm: Shuffle-WOR Sampler
At the start of each epoch:
  Draw a random permutation π of {1, ..., n}
  For t = 1, ..., n/B:
    Let S_t = { π((t-1)·B + 1), ..., π(t·B) }

💡 Live Demo: Shuffle - WOR

Shuffle Without Replacement 2 per batch

Balls are shuffled once then sliced into fixed batches of 2. Every ball appears exactly once, but positions are correlated — knowing one ball's batch reveals where others are not.
Shuffle order (left → right)
Total placements
Canary in batch
Batches full
Ready — click Step to place one ball at a time.

Sampler 3: Balls-and-Bins (BB)

Balls-and-Bins is the newest of the three. Think of each data point as a ball, and each training step as a bin. At the start of every epoch, each ball is independently and uniformly tossed into one of the $T$ bins — meaning each record is assigned to exactly one training step, chosen uniformly at random. The key word is independently: unlike shuffle, knowing where record $i$ landed tells you nothing about where record $j$ landed.

Algorithm: Balls-and-Bins Sampler
At the start of each epoch:
  Initialize all S_t = ∅
  For each data point i = 1, ..., n:
    Draw t uniformly at random from {1, ..., T}
    Add i to S_t

Pros: Every record appears exactly once per epoch (like shuffle), so the training loop looks the same. Privacy amplification is comparable to — or better than — Poisson in practical regimes [4], [7].

Cons: Batch sizes still vary (like Poisson), so it requires slight infrastructure changes compared to shuffle.

💡 Live Demo: Balls-and-Bins

Balls-and-Bins Sampling ✦ Best of both

Each ball independently picks one of the 5 batches uniformly at random. Every ball appears exactly once and placements are mutually independent — combining the strengths of Poisson and Shuffle.
Total placements
Canary in batch
Max batch size
Ready — click Step to place one ball at a time.

Why This Distinction Matters for Privacy

Now that the mechanics are clear, the privacy consequence follows naturally. The key difference is not just how records are assigned to batches, but what that assignment can reveal to an adversary.

Under Poisson, each step is independent. A canary may appear many times, once, or not at all, so observing many batches still does not let the adversary pin down where the canary was. Under Shuffle / WOR, however, the canary is guaranteed to appear exactly once per epoch. That means once the other batches look normal, the remaining unexplained batch becomes highly suspicious. In short, under shuffling the non-differing examples can leak information about the location of the differing example, while this is not the case under Poisson.

So even with the same noise and the same nominal training setup, the sampler alone can change the privacy cost. This is why using shuffling in training but Poisson in accounting can be misleading.


Comparing on Equal Footing: Zero-Out Adjacency

Before we can meaningfully compare the privacy cost of all three samplers, we need to make sure we are measuring privacy the same way for all three. This is trickier than it sounds, because different schemes naturally lend themselves to different adjacency definitions — the formal notion of what it means for two datasets to be "neighbors."

Three Notions of Adjacency

Add / Remove AdjacencyDD′+Substitution AdjacencyDD′x′Zero-Out AdjacencyD∇=0D′x

Add/remove adjacency says two datasets are neighbors if one can be obtained from the other by adding or removing a single record. Natural for Poisson (batch sizes vary, so it is natural to ask "is this record in the dataset or not?"), but the two datasets have different sizes — which naturally fits in Poisson sampling but breaks fixed-size-batch schemes like shuffle.

Substitution (edit) adjacency says two datasets are neighbors if one record is replaced by a different record. Both datasets have the same size, which fits shuffle, but it is not directly comparable to add/remove: a substitution is effectively two add/remove operations in a row, requiring roughly twice the noise — making comparisons unfair [5].

Zero-out adjacency bridges the gap. One dataset $D$ contains a real record $x_i$; its neighbor $D'$ is identical except that $x_i$ is replaced by a special "null" record $\perp$ whose gradient contribution is always exactly zero:

$$\nabla \ell(\perp;, \theta) \equiv 0 \quad \text{for all } \theta$$

Both datasets have the same size, and the effect of $\perp$ is semantically equivalent to the record simply not being there. Crucially, DP-SGD under Poisson with zero-out adjacency is equivalent to DP-SGD with add/remove adjacency, so we lose nothing by using it. And because zero-out keeps dataset sizes constant, it applies cleanly to shuffle and BB as well.

Zero-out adjacency gives us a single, consistent measuring stick across all three sampling schemes — and it is the adjacency notion we use throughout our experiments.


Experimental Setup

Baseline Task

We use the MNIST classification problem as a simple baseline, which consists of 28x28 grayscale images of handwritten digits (10 classes). We use a small subset of 20 000 samples (16 000 train, 2 000 test, 2 000 val).

For each batch sampling method, we set the expected batch size to be 32; this allows fair comparison between methods even if they have variable batch sizes.

For our model, we use a shallow CNN with two convolutional layers. We train this for 10 epochs at a learning rate of 0.05 using the unmodified SGD optimizer. We run a small hyperparameter search to find the optimal value of gradient norm clipping in the set $\lbrace 1, 1.25, 1.5, 1.75, 2 \rbrace$.

Rotationally Invariant Task

To introduce rotational invariance, we randomly rotate input images by multiples of 90$^\circ$.

We consider two approaches to tackling the problem:

Data Augmentation: Use the same model, but augment the training set with copies of each image rotated at multiples of 90$^\circ$

Group Invariant CNN: Use C4-invariant steerable CNN (implemented using escnn [10]) to incorporate the rotational invariance of the problem directly into the model

Fair(ish) Comparison

Our goal is to compare sampling methods under the same privacy guarantees. In particular, we use a target of $\varepsilon = 5$, $\delta = 10^{-6}$.

We use the following methods for computing bounds based on the noise scale value for each sampler:

Unaugmented Dataset

  • Poisson: Existing accountants (e.g. from Opacus [11]) can compute tight upper bounds
  • Balls-and-Bins: Feldman & Shenfeld [7] provide a method for finding tight upper bounds
  • Shuffle-WOR: Chua et al. [2] provide a method for finding lower bounds.

Note: Technically our comparison isn't fair, since we could be underestimating the amount of noise needed for Shuffle-WOR; however, Chua et al. conjecture that their bounds are really also tight upper bounds, and theoretical auditing has thus far agreed.

Augmented Dataset:

  • Poisson: Choquette-Choo et al. [9] provide a method for finding tight upper bounds
  • Balls-and-Bins: Feldman & Shenfeld [7] provide a method for finding nearly tight upper bounds of k-of-t random allocation.

Note: These bounds are not technically correct for our scenario, as they assume no repeat sampling in a single batch. However, the size of our dataset is sufficiently large so that repeat samplings are relatively rare, so we feel its a decent approximation.

  • Shuffle-WOR: No research was found, so we derived our own lower bounds.

Interactive $\varepsilon$ vs. $\sigma$ plot:

N:
k:

Ultimately, the noise scales that produce the desired bounds are:

PoissonShuffle-WORBB
Baseline $\sigma$0.57681.08810.5635
Augmented $\sigma$0.68581.21120.6279
  • Balls-and-Bins always has the least required noise, followed by Poisson and Shuffle-WOR.
  • Training on the augmented dataset requires more noise, which makes sense as we have to offset the fact that the mechanism will see each sample more.

Theoretical Results: Novel Lower Bounds for Shuffle-WOR

To understand how we proved our bounds on Shuffle-WOR, let's briefly look at how the original bounds were derived by Chua et al.

Dominating Pairs

As discussed earlier, we typically abstract away the complicated mechanics of the DP-SGD scheme (e.g. architecture, optimizer, etc..) as an ABLQ mechanism, leaving only the batch sampler and noise scale as parameters. But where do we go from here?

The key idea is to try and find a worst-case scenario. We represent this as a pair of distributions P and Q, also known as a dominating pair. The defining property for these distributions is that its always easier to distinguish between P and Q than outputs of the mechanism from two neighbouring datasets; in other words:

$$\delta_{P, Q}{(\varepsilon)} \ge \delta_{M(x), M(x')}{(\varepsilon)}$$

In future, we'll represent this more succinctly as:

$$(P, Q) \succcurlyeq M$$

So to get upper bounds on our privacy guarantees, we can simply work with this pair instead, which is usually much simpler.

If a pair is tightly dominating, then this worst-case scenario is actually realized in some pair of neighbouring datasets, and thus no possible improvement can be made to the produced upper bounds.

In addition to being easier to prove bounds on, dominating pairs also make composition easy; a dominating pair for the composition of two mechanisms $M_1$ and $M_2$ is simply the composition of their dominating pairs;

$$(P_1 \times P_2, Q_1 \times Q_2) \succcurlyeq M_1 \cdot M_2$$

The Shuffle-WOR Case

Unfortunately, its not always easy to find a dominating pair. For the ABLQ mechanism with the Shuffle-WOR batch sampler, the question is still open for whether such a pair even exists.

However, Chua et al. seek only to find a decent lower bound, so they look at a specific instance of the adaptive query method $A$ and pair of neighbouring datasets $(\mathbf{x}, \mathbf{x'})$. In particular, the query method is unadaptive and simply returns the sign of the input value: $$\psi_t(x) = \text{sign}(x)$$

The neighbouring datasets are:

$$\mathbf{x} = (x_1 = -1, x_2 = -1, \ldots, x_n = 1)$$ $$\mathbf{x'} = (x_1 = -1, x_2 = -1, \ldots, x_n = \perp)$$

Using the definition of $ABLQ$, we find that the output distributions on $x$ and $x'$ are:

$$ABLQ_S(x) = \sum_{t=1}^T \frac{1}{T} \cdot \mathcal{N}(-b \cdot \mathbf{1} + 2e_t, \sigma^2 I)$$ $$ABLQ_S(x') = \sum_{t=1}^T \frac{1}{T} \cdot \mathcal{N}(-b \cdot \mathbf{1} + e_t, \sigma^2 I)$$

where $e_t$ is the $t$-th basis vector, and $b$ is the batch size.

Intuitively, this says that in the shuffle method, the canary / zero out variable occurs exactly once at a random batch $t$ with uniform probability. The baseline output of a batch if all samples are negative is $-b$. In the batch where the canary occurs, the difference in output is 2 (replacing a -1 with 1), and where the zero-out term occurs, the difference is 1 (replacing a -1 with 0).

Since a constant shift doesn't change the privacy guarantees, we can simplify this pair to:

$$P = \sum_{t=1}^T \frac{1}{T} \cdot \mathcal{N}(2e_t, \sigma^2 I)$$ $$Q = \sum_{t=1}^T \frac{1}{T} \cdot \mathcal{N}(e_t, \sigma^2 I)$$

Note: Intuitively, this mechanism and pair of datasets feel like the worst case scenario; the output of the query method on the differing sample is as far from the other samples as possible. This motivated Chua et al. to conjecture this to be a tightly dominating pair.

Our Modification

Since we augment our dataset to have $k$ replicas, our respective neighbouring datasets are:

$$\mathbf{x} = (x_1 = -1, x_2 = -1, \ldots, x_{n-k + 1} = 1, \ldots, x_n = 1)$$ $$\mathbf{x'} = (x_1 = -1, x_2 = -1, \ldots, x_{n-k + 1} = \perp, \ldots, x_n = \perp)$$

By a similar logic, our lower bound / conjectured tightly-dominating pair is:

$$P_k = \sum_{(t_1, t_2, \ldots, t_k) \in [T]^k} \frac{1}{T^k} \cdot \mathcal{N}\left(2\sum_{i=1}^k e_{t_i}, \sigma^2 I\right)$$ $$Q_k = \sum_{(t_1, t_2, \ldots, t_k) \in [T]^k} \frac{1}{T^k} \cdot \mathcal{N}\left(\sum_{i=1}^k e_{t_i}, \sigma^2 I\right)$$

Privacy Loss Distributions (PLDs)

Numerically working with dominating pairs can still be difficult, since they are potentially arbitrarily complex multivariate distributions. Most modern privacy accountants make an additional simplification by looking at the distribution of a single, univariate "privacy loss variable":

$$L = \log{\frac{P(x)}{Q(x)} }, x \sim P $$

This variable preserves the desirable properties of dominating pairs, like the ability to compute $(\varepsilon, \delta)$ guarantees:

$$\delta(\varepsilon) = \mathbb{E}[\max(0, 1 - e^{\varepsilon - L})]$$

as well as easy composition, since the privacy loss variable of a composition is simply the sum of the privacy loss variables of the components:

$$L_{total} = L_1 + L_2$$

Since its a univariate random variable, it's also much easier to work with numerically, especially for doing operations like composition with techniques like FFT.

Pairs to PLDs

However, we once again run into a problem; it is not always easy to express the PLD in a nice way. For our particular pairs, finding the PLD naively requires numerical integration over $T$ variables, which quickly becomes infeasible for even small values of $T$.

Since we are only looking for lower bounds, we can use a simplification. Instead of working with continuous distributions $P, Q$, we turn them into discrete distributions over a finite set of disjoint events $\mathcal{E}$. By the "DP post-processing theorem" (Dwork and Roth [12]), we know that:

$$(P, Q) \succcurlyeq (P^\mathcal{E}, Q^\mathcal{E})$$

so our discrete distributions will give a lower bound approximation.

Note: For some intuition, you can imagine that discretization loses some details about the of each distribution, akin to pixelating an image; naturally, these will be harder to tell apart.

Chua et al. choose their events to be:

$$ \mathcal{E}\kern0pt_{i} = \left\lbrace w \in \mathbb{R}^{T} \mid C_i \lt \max w_{t} \le C_{i+1} \right\rbrace $$

where the sequence $C_i$ is chosen to be sufficiently dense.

This particular set of events is chosen since it is easy to compute $P(\mathcal{E}_i)$ and $Q(\mathcal{E}_i)$:

$$P(\mathcal{E}\kern0pt_i) = \phi \left(\frac{C_{i + 1} - 2}{\sigma} \right) \cdot \phi \left(\frac{C_{i + 1} }{\sigma}\right) ^ {T - 1} - \phi \left(\frac{C_i - 2}{\sigma} \right) \cdot \phi \left(\frac{C_i }{\sigma}\right) ^ {T - 1}$$

$$Q(\mathcal{E}\kern0pt_i) = \phi \left(\frac{C_{i + 1} - 1}{\sigma} \right) \cdot \phi \left(\frac{C_{i + 1} }{\sigma}\right) ^ {T - 1} - \phi \left(\frac{C_i - 1}{\sigma} \right) \cdot \phi \left(\frac{C_i }{\sigma}\right) ^ {T - 1}$$

We can also compute the probabilities of these events for our pairs $P_k$ and $Q_k$, though the expression is more complicated and will not be discussed here.

Finally, we represent the PLDs of these pairs of discrete distributions using Google's dp_accounting library, which provides methods for computing $(\varepsilon, \delta)$ guarantees as well as accurate N-fold compositions, allowing us to compute guarantees for multiple epochs.

Experimental Results

The test accuracies for each experiment are:

ExperimentPoissonShuffle-WORBalls-and-Bins
Baseline0.89680.86760.8995
Data Augmentation0.84150.79310.8539
Invariant Model0.88340.84280.8850
  • Balls-and-Bins performs the best at all tasks, likely due to requiring the least amount of noise for the same guarantees.
  • For the rotationally invariant task, using an invariant model appears to be a better strategy than using data augmentation; the decrease in required noise appears to make up for seeing the data less.
  • As a minor note, Shuffle-WOR seems to perform decently despite requiring significantly more noise. This could potentially be explained by the fact that Shuffle-WOR is known to converge faster than Poisson on un-noised SGD [6]; intuitively, this is because there is less variance in the number of times the optimizer sees each sample.

Limitations and Future Work

Auditing

The original vision for this project was more focused on privacy auditing (running empirical membership-inference attacks to estimate actual leakage); this would have let us directly compare $\varepsilon_\text{emp}$ to theoretical bounds, the way Annamalai et al. [3] have done. However, in practice, auditing batch samplers requires training up to $10^6$ times, which is beyond our compute budget.

Theoretical gaps

Only lower bounds on Shuffle-WOR currently exist, and tight upper bounds remain an open problem ([2], [3], [8]). For the augmented BB case, the Feldman & Shenfeld [7] bounds do not precisely fit our scenario, and bounds that account for possible repeated sampling in a single batch remain to be derived.

Future directions

The most immediate direction is to simply compare on a larger, more realistic rotationally invariant task (e.g. satellite object detection, medical imaging). Rotated MNIST has many inherent flaws (e.g. some digits are truly indistinguishable once rotated), so a better problem might serve as a more useful comparison.

Research into low-compute auditing of batch samplers could also be interesting; this would allow us to test how tight our derived lower bounds for Shuffle-WOR k-repeated sampling are.

Summary and Takeaways

  • Do not mix guarantees. Most DP-SGD deployments use shuffling for efficiency but compute privacy budgets assuming Poisson; this can result in significant underreporting of the true privacy leakage.
  • Use the Balls-and-Bins sampler. For most situations, Balls-and-Bins has significant advantages in speed / training efficiency over Poisson sampling, and unlike Shuffle-WOR, it has provable tight upper bounds, making it safe to use.

Please find our experiement code here for your further interest.

References

[1] M. Abadi et al., ‘Deep Learning with Differential Privacy’, in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 308–318.

[2] L. Chua et al., ‘How Private are DP-SGD Implementations?’, arXiv [cs.LG]. 2024.

[3] M. S. M. S. Annamalai, B. Balle, J. Hayes, and E. D. Cristofaro, ‘To Shuffle or not to Shuffle: Auditing DP-SGD with Shuffling’, arXiv [cs.CR]. 2025.

[4] L. Chua et al., ‘Balls-and-Bins Sampling for DP-SGD’, arXiv [cs.LG]. 2025.

[5] C. J. Lebeda, M. Regehr, G. Kamath, and T. Steinke, ‘Avoiding Pitfalls for Privacy Accounting of Subsampled Mechanisms under Composition’, arXiv [cs.CR]. 2025.

[6] J. Z. HaoChen and S. Sra, ‘Random Shuffling Beats SGD after Finite Epochs’, arXiv [math.OC]. 2019.

[7] V. Feldman and M. Shenfeld, ‘Efficient privacy loss accounting for subsampling and random allocation’, arXiv [cs.LG]. 2026.

[8] L. Chua, B. Ghazi, P. Kamath, R. Kumar, P. Manurangsi, A. Sinha, and C. Zhang, "Scalable DP-SGD: Shuffling vs. Poisson Subsampling," NeurIPS, 2024.

[9] C. A. Choquette-Choo, A. Ganesh, T. Steinke, and A. Thakurta, "Privacy Amplification for Matrix Mechanisms," arXiv:2310.15526, 2023.

[10] M. Weiler and G. Cesa, "General E(2)-Equivariant Steerable CNNs," NeurIPS, 2019.

[11] A. Yousefpour et al., ‘Opacus: User-Friendly Differential Privacy Library in PyTorch’, arXiv preprint arXiv:2109. 12298, 2021.

[12] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014

Andy Cai, Yvonne Yifeng Peng
Published on 

Previous: Lean4FHE

Next: zkPi+