Skip to article frontmatterSkip to article content

Much current research in data science involves machine learning (ML) models interacting with data sourced from a large number of individuals with significant variation in the general level of awareness, consent, and understanding of research goals. As such, researchers have a responsibility to protect the confidentiality and privacy of the people whose data is being processed. At the same time, sharing both data and trained models drives scientific advancement and promotes important social goals in open and transparent science.

It’s important to note that local and international regulations such as the General Data Protection Regulation (GDPR) and the EU’s policy on trustworthy AI also establish legal duties and principles on privacy protection that the following tools may help researchers meet.

Sharing data with privacy

Training a complex ML model can often require a very large amount of data, more than a single researcher or organisation could feasibly generate. Sharing our data not only helps us to create more reproducible research, but promotes advancements in the field as a whole. However, this does pose the risk of inadvertently sharing personal information that could be used to identify a subject.

Most researchers will remove uniquely identifying information (such as ID numbers, address, and phone numbers) before publication, but recent research has shown that with access to secondary datasets, such ‘pseudonymised’ datasets can still be traced back to the individual Narayanan & Shmatikov, 2008Sun et al., 2012.

Differential privacy

Differential privacy is a statistical tool which can estimate the risk of uniquely identifying a member of a dataset, whereupon calibrated noise can be added to ensure that privacy is preserved Yang et al., 2012.

Synthetic data generation

If sharing the original data raises privacy or ethical concerns, we can still contribute useful information by sharing synthetic datasets that reproduce statistical features of the original dataset without exposing actual instances Torfi et al., 2020.

Useful resources

Learning with privacy

Beyond sharing data with other researchers, we can also share our trained models, or make them available as a service: carrying out predictions on data provided by others without the need for them to invest time and resources in training their own systems. However, this sharing can also carry risks for personal privacy. For instance, many ML solutions require users to send personal data to a central server to process, exposing them to the risk of interception or misuse. The model itself may learn sequences from the dataset that we don’t wish to be retained, a process referred to as unintended memorization Carlini et al., 2019. This could be particularly harmful when considering models dealing with large amounts of user-created text Brown et al., 2022.

Federated learning

Federated Learning is a design paradigm in which the users’ data never leaves their own devices, with the model itself being broken down into a set of computations that take place on the edge, before updates are sent back to a central coordinator Kairouz et al., 2019.

Adversarial learning

We can also draw on the experience of research in the field of cross-domain training to teach models to ignore undesirable data by directly controlling the training process Coavoux et al., 2018. This can also be extended beyond private attributes to elimination of unwanted biases Zhang et al., 2018.

Differential privacy

Differential privacy has also seen significant use as a technique for preserving privacy during model training, reducing the risk of the model learning individual data points too well by adding small amounts of statistical noise during training Boulemtafes et al., 2020Feyisetan et al., 2020.

Useful resources

References
  1. Narayanan, A., & Shmatikov, V. (2008). Robust De-Anonymization of Large Sparse Datasets. Proceedings - IEEE Symposium on Security and Privacy, 111–125. 10.1109/SP.2008.33
  2. Sun, X., Wang, H., & Zhang, Y. (2012). On the Identity Anonymization of High-Dimensional Rating Data. Concurrency Computation Practice and Experience, 24, 1108–1122. 10.1002/cpe.1724
  3. Yang, Y., Zhang, Z., Miklau, G., Winslett, M., & Xiao, X. (2012). Differential Privacy in Data Publication and Analysis. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 601–606. 10.1145/2213836.2213910
  4. Torfi, A., Fox, E. A., & Reddy, C. K. (2020). Differentially Private Synthetic Medical Data Generation Using Convolutional GANs. arXiv:2012.11774 [Cs].
  5. Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., & Song, D. (2019). The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. Proceedings of the 28th USENIX Security Symposium, 267–284.
  6. Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What Does It Mean for a Language Model to Preserve Privacy? arXiv:2202.05520 [Cs, Stat].
  7. Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., … Zhao, S. (2019). Advances and Open Problems in Federated Learning. arXiv:1912.04977 [Cs].
  8. Coavoux, M., Narayan, S., & Cohen, S. B. (2018). Privacy-Preserving Neural Representations of Text. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1–10.
  9. Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating Unwanted Biases with Adversarial Learning. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 335–340.
  10. Boulemtafes, A., Derhab, A., & Challal, Y. (2020). A Review of Privacy-Preserving Techniques for Deep Learning. Neurocomputing, 384, 21–45. 10.1016/j.neucom.2019.11.041
  11. Feyisetan, O., Balle, B., Drake, T., & Diethe, T. (2020). Privacy- And Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. WSDM 2020 - Proceedings of the 13th International Conference on Web Search and Data Mining, 178–186. 10.1145/3336191.3371856