Much current research in data science involves machine learning (ML) models interacting with data sourced from a large number of individuals with significant variation in the general level of awareness, consent, and understanding of research goals. As such, researchers have a responsibility to protect the confidentiality and privacy of the people whose data is being processed. At the same time, sharing both data and trained models drives scientific advancement and promotes important social goals in open and transparent science.
It’s important to note that local and international regulations such as the General Data Protection Regulation (GDPR) and the EU’s policy on trustworthy AI also establish legal duties and principles on privacy protection that the following tools may help researchers meet.
Sharing data with privacy¶
Training a complex ML model can often require a very large amount of data, more than a single researcher or organisation could feasibly generate. Sharing our data not only helps us to create more reproducible research, but promotes advancements in the field as a whole. However, this does pose the risk of inadvertently sharing personal information that could be used to identify a subject.
Most researchers will remove uniquely identifying information (such as ID numbers, address, and phone numbers) before publication, but recent research has shown that with access to secondary datasets, such ‘pseudonymised’ datasets can still be traced back to the individual Narayanan & Shmatikov, 2008Sun et al., 2012.
Differential privacy¶
Differential privacy is a statistical tool which can estimate the risk of uniquely identifying a member of a dataset, whereupon calibrated noise can be added to ensure that privacy is preserved Yang et al., 2012.
Synthetic data generation¶
If sharing the original data raises privacy or ethical concerns, we can still contribute useful information by sharing synthetic datasets that reproduce statistical features of the original dataset without exposing actual instances Torfi et al., 2020.
Useful resources¶
- Individual Risk Tool A useful visualisation of how a few pieces of information can uniquely identify you
- The Algorithmic Foundations of Differential Privacy The foundational paper to the study of differential privacy.
- What is privacy-preserving synthetic data? A straightforward introduction to the concept of synthetic data
Learning with privacy¶
Beyond sharing data with other researchers, we can also share our trained models, or make them available as a service: carrying out predictions on data provided by others without the need for them to invest time and resources in training their own systems. However, this sharing can also carry risks for personal privacy. For instance, many ML solutions require users to send personal data to a central server to process, exposing them to the risk of interception or misuse. The model itself may learn sequences from the dataset that we don’t wish to be retained, a process referred to as unintended memorization Carlini et al., 2019. This could be particularly harmful when considering models dealing with large amounts of user-created text Brown et al., 2022.
Federated learning¶
Federated Learning is a design paradigm in which the users’ data never leaves their own devices, with the model itself being broken down into a set of computations that take place on the edge, before updates are sent back to a central coordinator Kairouz et al., 2019.
Adversarial learning¶
We can also draw on the experience of research in the field of cross-domain training to teach models to ignore undesirable data by directly controlling the training process Coavoux et al., 2018. This can also be extended beyond private attributes to elimination of unwanted biases Zhang et al., 2018.
Differential privacy¶
Differential privacy has also seen significant use as a technique for preserving privacy during model training, reducing the risk of the model learning individual data points too well by adding small amounts of statistical noise during training Boulemtafes et al., 2020Feyisetan et al., 2020.
Useful resources¶
- Privacy in Deep Learning: A Survey A useful brief overview of some extant threats and mitigation strategies.
- Deep learning and differential privacy Frank McSherry’s thought provoking blog about the privacy research landscape.
- Privacy Preserving Machine Learning: Maintaining confidentiality and preserving trust A recent overview from Microsoft Research of privacy-preserving learning
- PySyft A Federated Learning and privacy preservation library designed for compatibility with major machine learning frameworks PyTorch and TensorFlow
- Narayanan, A., & Shmatikov, V. (2008). Robust De-Anonymization of Large Sparse Datasets. Proceedings - IEEE Symposium on Security and Privacy, 111–125. 10.1109/SP.2008.33
- Sun, X., Wang, H., & Zhang, Y. (2012). On the Identity Anonymization of High-Dimensional Rating Data. Concurrency Computation Practice and Experience, 24, 1108–1122. 10.1002/cpe.1724
- Yang, Y., Zhang, Z., Miklau, G., Winslett, M., & Xiao, X. (2012). Differential Privacy in Data Publication and Analysis. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 601–606. 10.1145/2213836.2213910
- Torfi, A., Fox, E. A., & Reddy, C. K. (2020). Differentially Private Synthetic Medical Data Generation Using Convolutional GANs. arXiv:2012.11774 [Cs].
- Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What Does It Mean for a Language Model to Preserve Privacy? arXiv:2202.05520 [Cs, Stat].
- Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., … Zhao, S. (2019). Advances and Open Problems in Federated Learning. arXiv:1912.04977 [Cs].
- Coavoux, M., Narayan, S., & Cohen, S. B. (2018). Privacy-Preserving Neural Representations of Text. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1–10.
- Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating Unwanted Biases with Adversarial Learning. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 335–340.
- Boulemtafes, A., Derhab, A., & Challal, Y. (2020). A Review of Privacy-Preserving Techniques for Deep Learning. Neurocomputing, 384, 21–45. 10.1016/j.neucom.2019.11.041
- Feyisetan, O., Balle, B., Drake, T., & Diethe, T. (2020). Privacy- And Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. WSDM 2020 - Proceedings of the 13th International Conference on Web Search and Data Mining, 178–186. 10.1145/3336191.3371856