Secure and Robust Collaborative Learning

Hidde Lycklama

Nicolas Küchler

Alexander Viand

Nikolay Avramov

Anwar Hithnawi

Machine learning algorithms continue to achieve remarkable success in a wide range of applications. These advancements are possible, in part, due to the availability of large domain-specific datasets, for training machine learning models. Hence, there are expanding efforts to collect more representative data to train models for new applications. This raises serious concerns regarding the privacy and security of the collected data. The privacy ramifications of massive data collection in the machine learning landscape have led both industry and academia to work on alternative privacy preserving paradigms of machine learning.

Decentralized, secure machine learning techniques have emerged as a compelling alternative for sensitive applications in the last few years. These paradigms eliminate the need to pool data centrally for ML training and thus ensure data sovereignty and alleviate the risks associated with the large-scale collection of sensitive data. Although they provide many privacy benefits, these systems make sacrifices in terms of robustness. Unfortunately, an actively malicious participant can influence the model behavior (e.g., backdoors) in these settings without being detected. As these systems are being deployed in practice for a range of sensitive applications, their robustness is growing in importance. In this project we are interested in investigating work that helps better understand and alleviate integrity issues in secure decentralized learning paradigms.

• RoFL (Published in IEEE S&P’23) : Federated Learning (FL), though it presents many privacy benefits, amplifies ML robustness issues by exposing the learning process to an active attacker that can potentially be present throughout the entire training process and fine-tune their attacks to the current state of the model. Even though we have recently seen many attacks exposing severe vulnerabilities in FL, we still lack a holistic understanding of what enables these attacks and how they can be mitigated effectively. In our work we demystify the inner workings of existing attacks. We provide new insights on why these attacks are possible and why a definitive solution to FL robustness is challenging. We show that the inherent need for ML algorithms to memorize tail subpopulations has significant implications for ML integrity. This phenomenon has largely been studied in the context of privacy; in our analysis, we shed light on its implications for ML robustness. In addition, we show that constraints on client updates can effectively improve robustness against some severe attacks. To incorporate these constraints in secure FL protocols, we design and develop RoFL, a new secure FL system that enables constraints to be expressed and enforced on high-dimensional encrypted model updates. At its core, RoFL augments existing secure FL aggregation protocols with zero-knowledge proofs. Due to the scale of FL, realizing these checks efficiently presents a paramount challenge. We introduce several optimizations at the ML layer that allow us to reduce the number of cryptographic checks needed while preserving the effectiveness of our defenses. We build RoFL and show that it scales to model sizes as used in real FL deployments.

• CAMEL Cryptographic Audits for Machine Learning: To date, no compelling solution exists that fully addresses the robustness of secure decentralized learning paradigms. Verifiable computation and cryptographic constraints can alleviate some severe integrity issues. However, these techniques have limited effectiveness in mitigating integrity attacks that exploit memorization.This type of exploitation so far cannot be entirely prevented without impairing the ability of the model to learn fewer representative data points. As the robustness of these learning paradigms remains an open challenge, these systems need to be augmented with accountability. Towards this, we are working on cryptographic audits of decentralized secure learning systems that allow the sources of integrity issues to be traced back in a privacy-preserving way.