Why do we need Privacy Preserving Machine Learning?

Why do we need Privacy Preserving Machine Learning?

Why is privacy-preserving machine learning gaining popularity? How can organizations take advantage of such.

Why do we need Privacy Preserving Machine Learning?
Sidhartha Roy PhD

The last decade has seen rapid growth in Machine Learning and Artificial Intelligence applications. As rightly pointed out by Andrew Ng, AI is indeed the new electricity that is transforming almost every industry. The forces driving this level of growth are data and computing power. Both the amount of data generated and the computing power will keep growing exponentially for the foreseeable future. Since the outcomes of AI models are only as good as the data, having access to more data will mean we can build better AI models. It is obvious that we need more data, but we also need data that is sufficiently diverse and is regularly updated.

Source: Dart Consulting [1]

But what’s the catch?

Every time you search something on the web, or ask a question to Alexa, or use any app you downloaded from the internet — or even make some financial transactions — you are feeding the companies with data. Data, that might be personal.

People tell Google things, they might not tell anybody else.
— Seth Stephens-Davidowitz (Author of Everybody Lies)
Data from different sources is aggregated at a central location, leading to major privacy concerns.

Our personal data is stored in large databases held by organizations. This data is then used by them to make personalized predictions for us. For example, personalized movie recommendations by Netflix or product recommendation by Amazon or AD recommendation by Facebook all use our personal information. And let’s be honest: we are all addicted to these personalized experiences. So, let’s ask the question: should we trade our privacy for personalization?

We don’t actually see the downsides of feeding our personal data to companies until we read news headlines like this:

New York Times Article (2018)

While some applications use data to provide us personalized recommendations, there are others — especially in the health sector — that use data to save lives. Healthcare organizations have shifted to digital record keeping and built their own data infrastructure for their needs. However, the data accumulated by these organizations are separated across different organizations [2], which creates data islands.

Although bridging these data islands would drastically improve patient care privacy concerns and ownership issues act as hurdles towards that goal. It becomes impossible to bridge such data islands because of restrictions and privacy laws such as GDPR [3] and CCPA[4]. Simply adding AI to such fragmented systems is not enough.

Photo by Yulia Agnis on Unsplash

There are two different scenarios discussed above. The first is known as a B2C or business-to-consumer setting: where each individual owns their personal data and wants it to be used to get personalized experience — but not at the cost of a privacy breach. The second is a B2B or business-to-business setting: where large organizations own and store data islanded away due to privacy concerns but would like it to be used to build better models. In a perfect world, we could bring together all the data to a central location and use it to build better AI systems and hope the data is used responsibly.

The privacy-preserving way

However, there is another way of looking at this problem. Rather than the data traveling from various sources to a central location, we could let the machine learning model travel across locations.

Instead of the data traveling from various sources to a central location, we could let the machine learning model travel across locations.

Normally, data scientists collect and aggregate data in a central location and use it to train ML models. But since so much of the world’s data is locked up in these data islands, scientists and engineers have been trying to develop solutions that do not depend on a central data source. This idea forms the basis of privacy-preserving machine learning systems which is commonly known as federated learning (FL) or federated machine learning (FML).

In a federated setting, the machine learning model is locally trained at the source which could be a data source or an edge device containing private user data. Then, the locally trained model is sent to a central location where the central model is updated.

In a federated learning setting, the data is secure at its original location. Only the model parameters are transferred across locations.

This type of ML training was famously performed by McMahan et al. [5] for updating language models in mobile phones at Google. Therefore, federated learning can be used to build ML models that let the data stay at its original location while some ML model information is exchanged between locations. The exchanged information does not actively reveal personal or sensitive information.

Advantages of using Federated Learning

There are several advantages of using this type of training over the traditional approach. First and probably the most important reason is that the data never leaves its original location. The only things that communicated from the data source are the model parameters. So, no one has to convince different organizations to share their data. Second, by not moving the data we are able to reduce the communication burden. The cost of moving the model is orders of magnitude lower than moving the data itself. Third, by performing the training at the individual data locations we are able to overcome challenges associated with data normalization and preprocessing across the data sources. Data scientists will not have to worry about mapping the data from different sources to a common format.

Concluding Remarks

Therefore using federated learning we are able to obtain a robust model that is trained on diverse datasets while maintaining privacy. It is becoming increasingly popular across various types of industries. However, federated learning comes with its own set of challenges such as adversarial attacks, data leakage, and model tampering that data scientists need to be careful about.

Thanks to Skylar Alexander and Kartik Chopra for edits and comments.

  1. http://www.dartconsulting.co.in/market-news/artificial-intelligence-market-landscape-key-players-use-cases-ai-growth/
  2. Panch, T., Mattie, H. & Celi, L.A. The “inconvenient truth” about AI in healthcare. npj Digit. Med. 2, 77 (2019)
  3. GDPR: https://en.wikipedia.org/wiki/General_Data_Protection_Regulation
  4. CCPA: https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act
  5. H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Agüera y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data (2016)