Private Set Intersection for Training Data Auditing

Angela Sha, Sierra Wyllie

Published on 08.04.2026

Our proposed pipeline for machine learning training data auditing with fuzzy private set intersection.

Motivation

Machine learning (ML) is nowadays both ubiquitous and valuable; the global market in ML was worth over 40 billion CAD in 2024. The range of applications for ML also continues to grow, from simple predictive models up to large language models and agentic systems. Much of this value derives from the data used to train these models, which is usually taken without informed consent. This is a form of data piracy, which is the unauthorized acquisition, use, or exploitation of data for training models. There are many reasons why individuals or organizations may want to know if their data has been somehow leaked or pirated, including to seek compensation or legal action, to be forgotten (for ethics or privacy concerns), or to identify the presence of leaks in their information security.

In this blog post, we will explore this problem using variants of private set intersection (PSI). PSI allows parties to identify the intersection between their datasets, without either party learning anything more about their counterpart's datasets. Our goal is to identify if PSI is a feasible avenue for ML training dataset auditing.

Over the next couple of paragraphs, we'll give an example of why this is an important problem, then get into some details about our methods. After that, we'll discover if these methods actually work. Finally we'll wrap-up with directions of future work for interested researchers.

I Have a Famous Dog

Let's say I run an instagram account for my dog, and we've become super popular. But I also started to notice that there's some AI-generated images of dogs that look very similar to my dog. Usually I charge a fee for someone to use my dog's likeness, so I want to know if my images were used to train the AI model.

Now, the best way for me to do this is by looking through the company's training dataset. But if the company allowed this, that would be a huge privacy violation for everyone else who has data in that training set.

So instead, we'll need a way to discover if my dog's images were in that database without actually inspecting the database. Ideally, we should also be able to do this without needing to trust the company or any third party.

The Training Data Problem

The problem is that the state-of-the-art computer vision models we might want to audit (e.g. DINOv2 [1] and Molmo [2]) have complex training data pipelines that go from data collection and sampling, to pre-processing steps such as transformations. Their training data comes from a variety of sources, which may include web scraping, externally curated repositories, internal proprietary data, or synthetic data. This means that there's many different ways an individual's data could come to be included in a training dataset, so it would be a lot of work if we each had to monitor our data among all of these channels.

In addition to the multiple sources of data, the data itself is high-dimensional and noisy. To improve a model's performance and robustness, training data is typically pre-processed with image transformations the model may see at inference time, including visual corruptions such as noise, blur, cropping, and other digital transformations [3, 4].

Knowing what these transformations are can help us during auditing since we now know exactly what versions of the image we're looking for. But, there's a big chance that the model owner doesn't even know all the data transformations are due to the way they collected their data.

For example, if they scraped an image from a social media site that compressed the original image, the company would be unaware that the image they hold is not the original version. That means that even if I applied their training data transformations to my dog images, our images would still not be identical.

Image 1: Image transformations on CIFAR dataset

All of these factors make tracing the source of and changes to the training data difficult. Training data auditing is the family of techniques that tries to answer the question: was my data used to train this model? The answer to this question has implications in privacy regulation, copyright enforcement, and for individuals looking to audit for their data.

How Can We Represent Images?

All images have some sort of representation, usually that would be numerical pixel values. For our purposes, there are two things we need so that life is a little easier:

Semantic usefulness. We need to discover if two images are similar (i.e. if one image is a transformed version of another). We can test that by seeing if the distance between their image representations is small. For that to hold, we need similar images to be close together, and dissimilar images to be far apart.
Compression. As we'll see later, its generally helpful if the image representation is a lot smaller than the image itself. This really helps with efficiency when we are comparing images.

The semantic usefulness requirement here complicates matters. Compression is relatively easy, and there are many hashing algorithms (cryptographic or otherwise) that could work. But if we compared conventional cryptographic hashes of an image and its transformation, we'd find that these representations are very far apart since these algorithms were not built to embed similar images together.

Instead, let's look at the two image representation methods we use in this work.

Perceptual Hashing

Perceptual hashing addresses the main limitation of cryptographic hashing algorithms; the lack of semantic usefulness [5]. We specifically use a version of perceptual hashing using the discrete cosine transformation (DCT) from the phash library. DCT-hashes represent frequency information and should be robust to transformations like resizing, compression, and colour shift.

Let's quickly step though the averaging perceptual hash algorithm to build intuition on how it works:

Perceptual hash example

Crush the image to a set size (e.g. 8 by 8 pixels).
Convert to greyscale.
Find the average pixel value. Then binarize the image; if a pixel value is above the average, it becomes a 1 (white), if below, a 0 (black).
Flatten the binarized image to get a 64-bit representation.

We also show the code for this below, as there's a few subtle post-processing steps we need to do to get the representations.

import imagehash as ih

# perceputal hashing a torch dataset
def phash_dataset(train_dataset):
    hashes = []
    for i in range(len(train_dataset)):
        im = train_dataset[i][0]
        if type(im) == torch.Tensor:
            im = transforms.functional.to_pil_image(im)
        hash = ih.phash(im)
        hexstr = str(hash)
        b = bytes.fromhex(hexstr)
        b = int.from_bytes(b, byteorder="big")
        b = bin(b).lstrip("0b")
        hashes.append(np.array(list(b)).astype(int))
    hashes = np.asarray(hashes)
    return hashes

Neural Network Embeddings

MNIST image

Here's an example of a two-dimensional visualization of the embeddings of a MNIST digit recognition dataset. The embeddings themselves are high-dimensional vectors (d=84 in our example), but projected in a manner to show cosine similar images close to one another (with t-SNE [6]). In this representation (which encodes images by passing them through neural networks) we see that images of the same digit cluster together. We can see that cosine distance is a meaningful metric for how far apart two images are [7].

To get neural network embeddings, we do a forward pass of the images through a neural network and take the intermediate outputs from the final fully connected layer, as in the figure and corresponding code below.

class LeNet5FeatureExtractor(nn.Module):
    def __init__(self, trained):
        super(LeNet5FeatureExtractor, self).__init__()
        self.conv1 = trained.conv1  # nn.Conv2d(1, 6, 5, padding=2)
        self.conv2 = trained.conv2  # nn.Conv2d(6, 16, 5)
        self.fc1   = trained.fc1    # nn.Linear(16*5*5, 120)
        self.fc2   = trained.fc2    # nn.Linear(120, 84)
        # self.fc3 = nn.Linear(84, 10)
    
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        output = x
        # Comment out last fully-connected layer for feature extraction
        # x = self.fc3(x)
        # output = F.log_softmax(x, dim=1)
        return output

    def num_flat_features(self, x):
        size = x.size()[1:]
        return np.prod(size)

LeNet neural network embeddings

Distance Metrics

Due to our choice of image representation, for this blog post we explore using cosine and hamming distance. Let's revisit these metrics.

Cosine Distance. Cosine distance measures the difference in the angle between two vectors (image representations). Because it doesn't consider the magnitude of the difference, it's a popular choice for measuring the distance between two high-dimensional objects.

$$d_{\text{cos}}(x,y) = 1 - \frac{x \cdot y}{|x|_2 |y|_2}$$

Hamming Distance. Hamming distance is defined as the number of corresponding positions in two strings where the symbols differ. In this case, we're counting how many bits are different between the binary representations of the two embeddings.

$$d_{\text{ham}}(x,y) = \sum[x_i \neq y_i]$$

The Toolkit for Auditing Training Data

Now let's see how we can use model outputs and image representations to identify which datapoints was used in training. Auditing techniques can be split into two categories: passive methods that probe a model without modifying the data itself, and proactive methods that watermark data before it might be used. Since model users may not have access to many compute resources or a strong technical understanding, we focus on passive methods that can be applied to a model post-training, including membership inference attacks, private search, and private set intersection.

Membership Inference Attacks (MIAs)

Membership inference attacks rely on the idea of querying a model on a sample (e.g. an image of my famous dog) and comparing its behaviour (loss, confidence) to what you would expect had the model never seen the image. If the model predicts my data suspiciously well, then there's a good chance my data was in the training set. While commonly used as a training data auditing approach, this method has no strong guarantees of correctness and often can result in false positives. Moreover, it doesn't solve the training data transformation problem; the MIA might only trigger if we query with the transformed image, but not with the original image. Moreover, MIAs are very limited on large pre-trained models, often performing similar to random guessing [8].

Private Search

One solution that would provide cryptographic guarantees is private search, which is typically built on hashing or cluster-based indexing. Under the protocol, a party is able to search for and return embeddings under a distance metric without revealing the query to the server [9]. While applicable to our problem, we are interested in private set intersection since it allows us to compare an entire set of images rather than just a single image.

(Fuzzy) Private Set Intersection

Private set intersection is a cryptographic primitive that allows for two parties to compute the intersection between these their private sets [10]. We propose PSI as a new candidate for auditing training data, as it provides cryptographic guarantees to the auditor on any matches returned, and because it operates at a set level as users may have multiple images they want to audit.

While PSI protocols are typically designed to return exact matches, variants known as fuzzy PSI have also been outlined for privacy-preserving similarity matching, which is more useful for transformations of training data. Fuzzy PSI will return matches if the two images under comparison are within some threshold distance from each other.

Fuzzy PSI diagram

A primer on fuzzy PSI protocols using the distance metrics we consider:

The first of the distance metrics we consider, hamming distance, already has a fuzzy PSI implementation. It's based on homomorphic encryption [11, 12], but their construction leads to unavoidable and non-negligible false positive rates, which are undesirable in providing exact guarantees of the set intersection. Let's instead focus on cosine distance based PSI [13], which also uses homomorphic encryption. Cosine-PSI can better address the false positive problems and is shown to support embeddings up to dimension 512 (a realistic embedding size) with communication and computation costs linear in the embedding dimension.

The Setup

Let's overview the players:

The model provider holds a set of training data, and is also known as the prover.
The data owner (e.g. a publisher or individual) holds a set of data that they want to check. They're also known as the auditor.

The auditor aims to assess whether some transformation on their dataset was contained in the training dataset under a similarity score, without either party learning about the elements not in their set intersection.

The Non-Private Part

The model trainer embeds their training data (𝒯_D) using a publicly known, deterministic embedding model (e.g. LeNet5) or hashing algorithm. They publicly commit to the representation mechanism so everyone can verify that this specific mechanism was used to create the image representations.

$$P = \text{commit}(\text{embed} | \text{hash}(\mathcal{T}_P))$$

When the data owner wishes to query if their data (𝒯_A) is included in the model trainer's dataset, they will use the same mechanism to get their own image representations.

$$A = \text{embed} | \text{hash}(\mathcal{T}_A)$$

If the model trainer and data owner didn't worry about privacy, they could directly calculate the pairwise distances between the data owner's representations (P) and the training set representations (A). Any pairings underneath a certain distance threshold would be flagged, leading to the identification of any of the data owner's images used in training.

The Private Part

To keep P and A private, we use fuzzy PSI. We want to returns the matches under the threshold but without information gained from either party. To do this, we follow the protocol outlined in [13].

The protocol uses CKKS, a fully homomorphic (FHE) scheme for private computation over similarity scores between our embedding data.

The way you can think about FHE is that it is a cryptographic technique that allows computations to be privately computed over encrypted data (a ciphertext), without decrypting it first. Only the one who encrypted the data can reveal the plaintext result.

The prover and auditor first agree on a predefined threshold t, which could be selected from a set of representative public data.
The auditor encrypts their plaintext embedding with a public key pk to a ciphertext and submits it to provider.
$$\bar{A} = \text{Encrypt}_{\text{CKKS}}(A, pk)$$
The prover computes private plaintext-ciphertext multiplication (the most costly computation) to get unnormalized cosine similarity:
$$s = \bar{A}^T \cdot P$$
The prover performs encrypted evaluation over an approximate sign function over the predefined threshold t. This returns the elements that are under the similarity score without the need to release the score itself, and zeros out the elements that are above the threshold.
$$\bar{s} = \text{Threshold}_{\text{CKKS}}(s, t)$$
To keep the provider's inputs private, the provider uses noise flooding on the outputs. Sample Gaussian noise proportional to the cryptographic error, e ← 𝒩(0, σ²)
$$\bar{z} = \bar{s} + e$$
And return the noised outputs to the auditor.
The auditor now decrypts the output with their secret key sk, to get the indicator of whether or not each data point in their audit set contains a match under threshold.
$$z = \text{Decrypt}_{\text{CKKS}}(\bar{z}, sk)$$

The key idea here is that each party only needs to keep their inputs private to themselves, which helps us optimize the cryptographic protocol.

So Does It Work?

Basically, no... but it could!

Let's start by evaluating whether the non-private version of our system works. Can we successfully identify training images in datasets? Only partially. We test this by comparing every pair of images across three sets:

The main set of images.
The transformed version of the main set.
A completely disjoint set of images.

Distance results

We can visualize these results in the above figure. Each graph is comparing every pair of images between two sets, where comparing the nth image in one set to the nth image in the other creates a low-distance (blue) diagonal.

The first column is comparing images in the main set with each other, and we should see a crisp diagonal line (like in the hamming distance and neural network representations).

The second column is comparing the main set images with the transformed versions, and we should also see the diagonal. This is a little harder to see; our representations and distances are not able to detect the transformed version of an image. We have some success with neural network embeddings and cosine distance, but the false positive rate for this system would be quite high.

The third column compares the main set to the disjoint set. Ideally, the graph here should look mostly yellow as the images are relatively unrelated. Unfortunately, many of the images in MNIST look quite similar, so their representations are close.

As you can tell, none of the combinations or image representations and distances work the way we need them to. This is partly due to the specific transformations used; rotations, for example, cause a big challenge for our representations. Think of rotating a 6 to look like a 9, for example.

One direction that may lead to better results is using contrastive learning, where a neural network is trained to specifically embed an image and its transformations close to each other. In general, this problem seems understudied: there's many image representation works that use NNs, but relatively few methods train NNs to recognize transformed versions of images. Thus, most of the representation spaces NNs create embed similar images together, which while useful for a specific task means the space is not so useful for our goals.

PSI works! Fuzzy-PSI methods are able to adequately compare image representations without leaking private information. In the table below we show the prover and auditor computation times for comparing 32 or 64 images to a 60,000 image training set. Note that these times are for computation on a standard Apple M4 laptop, meaning that the times and computation costs are quite realistic for low-compute auditors and high-compute model owners.

Results for timing

Where Do We Go From Here?

Clearly, there's still some work to be done, but the big takeaway is that PSI could indeed be an avenue for training dataset auditing.

The next step is malicious security—how can we ensure that each party computed their parts of the algorithm correctly and honestly? To get this property, we would need to design a verifiable computation protocol for the entire process. Furthermore, it would also be essential to know if the training set being audited is the same set used to train the company's model. Otherwise, the company could conduct a switching attack by allowing one training set to be audited but also using a different set with people's information in it to create their model.

An easier version of this project that is still worth further research would be using PSI to detect exact image matches. To do this, we'd need the company to share their transformations so the auditor can replicate them on their own images and compare the two transformed sets.

Finally, we need more work in data governance to determine what the parties should do if there is an intersection. Much of this is work for regulators, and currently beyond the scope of our work.

Thank you for reading, if you have any questions please reach out to us at angela.sha@mail.utoronto.ca and sierra.wyllie@mail.utoronto.ca.

References

Oquab, M., et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193. https://arxiv.org/abs/2304.07193
Deitke, M., et al. (2024). Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models. arXiv preprint arXiv:2409.17146. https://arxiv.org/abs/2409.17146
Mu, N., & Gilmer, J. (2019). MNIST-C: A Robustness Benchmark for Computer Vision. arXiv preprint arXiv:1906.02337. https://arxiv.org/abs/1906.02337
Hendrycks, D., & Dietterich, T. (2019). Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. arXiv preprint arXiv:1903.12261. https://arxiv.org/abs/1903.12261
Venkatesan, R., Koon, S.-M., Jakubowski, M.H., & Moulin, P. (2000). Robust image hashing. In Proceedings 2000 International Conference on Image Processing (Vol. 3, pp. 664-666). IEEE.
van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709. https://arxiv.org/abs/2002.05709
Zhang, J., Das, D., Kamath, G., & Tramèr, F. (2025). Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data. arXiv preprint arXiv:2409.19798. https://arxiv.org/abs/2409.19798
Li, J., et al. (2025). Panther: Private Approximate Nearest Neighbor Search in the Single Server Setting. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (pp. 365-379). ACM.
Freedman, M.J., Nissim, K., & Pinkas, B. (2004). Efficient Private Matching and Set Intersection. In Advances in Cryptology - EUROCRYPT 2004 (pp. 1-19). Springer.
Chakraborti, A., Fanti, G., & Reiter, M.K. (2023). Distance-Aware Private Set Intersection. In 32nd USENIX Security Symposium (pp. 319-336). USENIX Association.
Uzun, E., Chung, S.P., Kolesnikov, V., Boldyreva, A., & Lee, W. (2021). Fuzzy Labeled Private Set Intersection with Applications to Private Real-Time Biometric Search. In 30th USENIX Security Symposium (pp. 911-928). USENIX Association.
Son, H., et al. (2025). Doubly Efficient Fuzzy Private Set Intersection for High-dimensional Data with Cosine Similarity. Cryptology ePrint Archive, Paper 2025/054. https://eprint.iacr.org/2025/054

Angela Sha, Sierra Wyllie

Published on 08.04.2026

Previous: Lean4FHE

Next: Does It Matter How You Pick Your Batches?