An Overview of Federated Machine Learning
Blog
Blog

An Overview of Federated Machine Learning

A complete guide to federated learning, including how it works, different variations, benefits, and common data scenarios.

An Overview of Federated Machine Learning
Leslie Barthel

Many companies today struggle to gain insight from their data. According to IDC, 68% of all data collected by organizations goes unused. Traditional data science approaches require data to be consolidated into data stores before analysis or training of AI models. This duplication and movement of data has resulted in delays in realizing value, increased operational risk, and complexity. 

With data volumes exploding and Gartner predicting that by 2025, 75% of enterprise-generated data will be "created and processed outside a traditional centralized data center or cloud," this approach to centralized data management and as a requirement for generating insight is increasingly untenable.

Further exacerbating the issue is growing legislation dictating how private data can be used, stored, and accessed. From the EU's GDPR to China's new data security laws, an estimated 75% of the global population will be covered by data privacy laws by 2024. This increased regulation is driving greater legal, compliance, and privacy overhead for companies and further impeding leverage of this valuable resource.

Enter federated machine learning. Initially invented in 2016, federated machine learning (FedML) is the future of data science. Instead of bringing the data to the model, it brings the model to the data—training algorithms where the data resides. 

What is Federated Learning & How Does it Work?

FedML is a machine learning technique that trains models across multiple, decentralized datasets without moving or exchanging data. The only centralized component is a centralized server that orchestrates the model training process. 

How does it work exactly? First, the centralized server sends the global model to each dataset, which could be anything from a CRM system to a SQL database to a legacy system. Then, the model trains on the local data and sends updated weights and parameters back to the centralized server. Finally, the updated parameters aggregate (or “average”) into the global model, and the process repeats.

Since the model trains locally—on data where it resides—federated machine learning obviates the need to centralize data. In addition, the distributed nature of the data means that FedML is adept at training on non-IID (non-Identically & Independently Distributed) and unbalanced data distributions. 

Let's consider an example. One use case of federated machine learning is predictive maintenance. 

For instance, a manufacturing company might have over 20 factory locations and want to reduce critical machinery downtime across each location. Typically, each site may manage maintenance in a vacuum, developing manual rules to take machinery offline based on predefined timelines. However, this approach typically results in two different scenarios. One, machinery ends up being over-maintained to avoid costly or, in some cases, life-threatening downtimes. Two, they lack visibility into the more complex behavioral patterns that could indicate impending equipment failure or degradation. 

With federated machine learning, they can analyze performance and maintenance data across all 20 locations without centralizing it. As a result, they’d potentially be able to recognize new patterns in the data (across a larger, 20-location dataset), predicting failure sooner and with greater accuracy and generalizability. This would allow the manufacturing company to optimize its maintenance schedule and more confidently prevent costly downtimes across all its locations.

Types of Federated Learning

Even though FedML has only been around for six years, there are several different flavors of federated learning, depending on the data scenario and approach:

  • Centralized vs. Decentralized/Peer-to-Peer - How the training is orchestrated
  • Horizontal vs. Vertical - How datasets are partitioned
  • Cross-Silo vs. Cross-Device - What types of devices are involved

1. Centralized vs. Decentralized

Centralized federated learning—the most common approach—uses a central server to orchestrate the different steps of the model training and averaging across all local data sources (hub and spoke). Our previous factory scenario is an example of centralized federated learning. Each factory trains locally and then communicates the learnings to a centralized, global model.

On the other hand, decentralized, federated learning requires individual data sources to coordinate between themselves without orchestration. In this case, model parameters are passed on from each individual dataset to the other in a chain for training. For example, in the factory scenario, model training would happen consecutively at each location, starting at the first factory before going onto the second, and so on. This approach avoids the centralized server but gives the datasets access to one another. As a result, it could potentially expose sensitive information or open up the model to poisoning if an untrusted party were to gain access (peer-to-peer). 

2. Horizontal vs. Vertical

Horizontal federated machine learning, also known as homogeneous or sample-based, is used when the datasets have the same schema. That means the datasets share the same features (columns) but consist of different samples (rows). For example, these could be sales databases for two separate retail locations. They each include a different list of customers but contain the same information about each customer, such as name, phone number, and purchase date. Consistency across datasets allows for more straightforward training because the datasets are constructed identically and only contain different inputs.

Vertical federated learning, also known as heterogeneous or feature-based, is applied to datasets containing differing feature sets. For example, this could be comparing a retail sales database to an online one. They may include some of the same information or even overlapping customers; however, the online database may contain new features, such as email and street address. Linking these two databases together can be a challenge, especially if the unique identifier for the customer is different (phone number vs. email). Still, it may provide additional, valuable data to a model.

3. Cross-silo vs. Cross-device

Cross-silo federated learning is when models are trained on data distributed across any organizational, regulatory, or functional barriers. In such cases, data is often stored in larger computing devices, such as cloud instances or bare metal servers. Therefore there tend to be a relatively small number of silos/training sets. 

Cross-device federated learning occurs when models train at the edge on the actual IoT devices, such as cell phones, drones, or Raspberry Pi-type systems. In this case, many millions of devices are necessary for the federation to work. This approach is limited by the low computing power of the individual devices and the increased potential for devices to be offline and thus unavailable to participate in the training process. It’s also restricted to only small, homogenous datasets—making this an incompatible approach for enterprise environments. 

The Benefits of Federated Machine Learning

The adoption of FedML is accelerating rapidly due to its many advantages. In fact, Gartner predicts by 2024, 80% of the largest global organizations will have participated at least once in FedML to create more accurate, secure, and environmentally sustainable models. 

1. Improved Model Accuracy & Generalizability

FedML enables collaborative learning and model improvements by training on local datasets and continuously incorporating those learnings into a centralized model. This continuous feedback loop results in a more accurate, generalizable model vs. training only on local datasets. 

For example, The U.S. National Institute of Health used FedML to create a model that could better predict future oxygen requirements of symptomatic COVID-19 patients. Training on data across 20 different facilities, the FedML model showed a 16% improvement across all participating sites and an average increase of 38% in generalization vs. models trained only on one site's data.

In addition, by leaving the data where it resides, FedML can unlock previously restricted datasets for analysis, including those containing proprietary or personal information. In situations where training data is hard to come by, FedML can bridge the gap, resulting in better models. 

2. Increased Privacy & Security

FedML provides several privacy and security advantages. First, the federated averaging process keeps the data in local systems, training on the private data where it resides. When the training is complete, it only communicates the updated weights and parameters back to the global model—never the raw data. As a result, FedML decreases the potential risk of data leakage, misuse, or exposure of sensitive information. In addition, leaving the data where it is also helps with compliance with regulations, such as HIPAA, GDPR, and CCPA.

The traditional approach to machine learning involves duplicating and centralizing data into one place. However, having all your data, including private data, in a single repository is a significant risk for an organization. If a bad actor could gain access to that centralized database, it would represent a significant data breach—negatively impacting brand reputation and potentially costing millions of dollars. Mitigating this risk, FedML maintains a decentralized data structure and significantly minimizes the potential attack surface.

In addition, moving data can result in serious data lineage issues, making it difficult to confirm whether the data is accurate or came from a trusted source. Resolving data lineage issues can be very time-consuming and costly if not appropriately tracked. FedML solves this problem by not moving the data in the first place, enabling data owners to retain control at the source and eliminating data lineage challenges caused by centralization. 

3. Boosted Speed & Efficiency

FedML achieves high computation efficiency, accelerating the deployment and testing of models while reducing latency. Since FedML doesn't require centralization, the lag time for model updating is diminished, making near real-time predictions possible. This is especially valuable for time-sensitive applications, such as self-driving cars and countering the financing of terrorism (CFT). 

FedML also overcomes system heterogeneity where local devices have unbalanced resources (e.g., computation, communication, storage, and energy). Local model training helps reduce bandwidth and energy consumption while offering asynchronous training cadences that remove resources at any dataset as a blocker. 

There is a green benefit to FedML as well. Duplicating and moving large datasets in and out of the cloud can consume a lot of energy and result in a large carbon footprint. By employing FedML, organizations can lower their data storage and movement overhead by only moving the model and not the data—significantly reducing their environmental impact.  

Federated Machine Learning Scenarios with Devron

There are many different applications and use cases for the Devron FedML platform. However, in our experience, they all ultimately boil down to three common data scenarios: a need to access private or immovable data, a need to analyze distributed datasets, or a desire to monetize or share data without exposing the raw information. 

Private Datasets

In the first scenario, a company may desire to access a restricted dataset. It could be off-limits because it contains PII or PHI, or maybe it’s out of reach due to impending M&A activity or regulatory restrictions, like GDPR. It could also include the financials of a publicly-traded company for which it requires an arduous compliance process before moving data. Whatever the reason, in this scenario, data science teams have a dataset they want to access and analyze but cannot.

Devron enables organizations to unlock access to previously inaccessible datasets such as these without moving or exposing any of the raw data. As a result, data science teams no longer need to mask the data or negotiate over features and columns. Instead, they can build and train models based on the raw data in situ. In addition, Devron employs privacy-enhancing technologies to ensure model learnings can never be reverse-engineered to reveal the source information. 

Multiple Datasets

The second scenario is when a company has multiple datasets they cannot or just do not want to move. For instance, they may not be willing to take on the increased cyber risk of centralization, or they may want to avoid the additional infrastructure cost associated with duplicating and storing data twice. Or they simply don't want to deal with the data engineering headache, lengthy ETL pipelines, data lineage, or privacy leakage issues. Further, they may be unable to move the data due to data sovereignty limitations, the sensitivity of the data whereby putting it in motion creates risk, or impractical bandwidth vs. data volumes.  

Whatever it may be, Devron can solve this problem and alleviate the pain of duplication and data movement. This is especially relevant in the Professional Services and Consulting industries, where they don't want to take on the liability of duplicating and moving their client's data even if they can get access. 

Monetization & Sharing

In the final scenario, a company wants to monetize its data. It may have datasets that other companies want to pay to access and analyze. However, as the data owner, they don't want to expose the raw information. This is particularly common in the financial services industry, as well as the healthcare industry. 

Devron can enable this scenario by allowing third parties to analyze and build models against a dataset they do not own and for which they cannot see the raw data. Forged in the most data-sensitive environments of the Defense and Intelligence communities, Devron is designed to meet the strictest regulatory and compliance requirements. The platform generates synthetic data during the pre-processing stage, giving the data scientist visibility into the dataset's schema and “feel of the data” while never exposing the source information. Unlike other solutions, Devron uniquely uses synthetic data for pre-processing only. 

Once the training begins, the model trains directly on the raw data itself, thus improving model accuracy. The results are limitless and could include anything from new revenue-generating data products to better financial benchmarks to new medical discoveries.

Devron’s Differentiated Federated Machine Learning Platform 

Devron is the first privacy-preserving FedML platform to enable vertical learning using a codeless UI

In a perfect world, every system would use the same taxonomy and collect the same fields. However, that's not realistic, and disparate datasets contain different schemas and potentially different unique identifiers. Using a proprietary approach, we enable data scientists to easily link data sources together and map relationships between various features. As a result, Devron customers can perform advanced analytics on numerous and widely diverse, disparate datasets—realizing models with greater accuracy, precision, and generalizability. 

Devron is a platform-agnostic solution that can be deployed inside a company's existing cloud environment and access data where it resides—whether AWS, Snowflake, Microsoft Azure, SalesForce, or otherwise. Data science teams have the option to build and train models using our codeless UI or a Jupyter notebook. Offering both options makes our solution accessible to a broader audience, including data science and machine learning experts, analysts, and business leadership.

To learn more about the Devron platform and how it can enable your organization to energize its data science initiatives and analyze disparate, private, and/or heterogeneous datasets, request a demo today.