Artificial intelligence dates back to 1950, when Alan Turing first proposed that computers would think like humans. Now, more than 70 years later, AI touches many aspects of our everyday lives.
Data is the fuel that powers AI. In the last decade, we've seen exponential YOY growth in the amount of data worldwide, with no signs of slowing down. Unfortunately, as data increases, so does the chasm between how much exists and how much gets analyzed. In fact, according to Forrester, 60-73% of all enterprise data goes unused for analytics purposes. Much of that is due to the unstructured and decentralized nature of data today. Valuable enterprise data spans different data silos, systems, and jurisdictions, making it nearly impossible to derive value using the typical centralized approach to AI and data science projects.
Initially developed by IBM, the CRoss-Industry Standard Process for Data Mining (CRISP-DM) serves as a standard model for the data science lifecycle. It consists of six stages and outlines the typical tasks and the relationships between them.
Despite being the industry standard for over 25 years, CRISP-DM has a few flaws.
First, the models' symmetry implies an equal amount of time, and emphasis should be attributed to each step. However, this is far from the truth. As outlined in a previous article, understanding the business problem and thoroughly analyzing the data available to solve that problem is essential to the success of any project. By focusing more on the first two steps, data scientists will be able to provide a strong foundation for the effort and assure better results for subsequent phases.
In addition, CRISP-DM depicts data at the center of the model because traditionally, most data science required data to be centralized and combined to gain value from it. However, as we know, data is not simply in one place. As a result, data scientists have to go through a lengthy process to acquire access to many disparate data sources—often waiting weeks or even months just for approvals. And once they gain access, they can't simply copy it from one place to another. That's because many datasets contain personally identifiable information (PII), subject to different privacy, regulatory, and compliance constraints depending on the nature of the data and where it lives.
Suddenly, the project requires cross-border data migration, redaction, anonymization, and synthetic generation—destroying potentially valuable raw signals in the process. Not to mention the ETL overhead of extracting, moving, cleaning, and transforming different datasets.
AI projects will continue to underperform if we depend only on the data we can copy, move, warehouse, or gain approval to analyze. This evolutionary challenge necessitates a new approach.
The new and improved approach considers the distributed nature of data today. Instead of bringing the data to the model, it brings the model to the data—training algorithms where the data resides. It also bypasses certain regulatory limitations by eliminating the need to move data across borders.
As a result, data science teams no longer need to waste time, effort, or money centralizing data, significantly reducing ETL overhead. They also no longer have to wait to gain access to the data or worry about anonymizing or redacting sensitive information.
Instead, they can train algorithms without leaking the privacy of the underlying data sources and be more agile as an organization via rapid experimentation.
Devron makes this decentralized data science paradigm a reality, allowing data scientists to train models at the point of data collection.
Devron's federated machine learning platform consists of two components: Central Authority and Satellite(s). The control center (or “Central Authority”) governs the federated cohort of datasets. Each dataset is essentially a “Satellite” and can be anything from a file in public cloud storage to a SQL database to a legacy system.
The Central Authority allows data scientists to configure satellites, develop models, and run experiments. Likely, several Satellites will connect to a single Central Authority, and each will train and send back local results. The Central Authority then aggregates those results and provides Satellites with an updated model for continued training.
Devron brings analytics to the data instead of data to the analytics. Raw data never transfers from the Satellites to the Central Authority—only model metadata. And the platform leverages Privacy Enhancing Technologies, like secure multiparty computation, differential privacy, and encryption to prevent this metadata from being reverse-engineered.
As a result, data stays put, data lineage is preserved, source data remains private, and data science teams obtain faster access and approvals.
To learn more about this new decentralized paradigm of model training and how Devron is deploying this approach, watch the recording of our webinar, Accelerating AI Business Value.