Decentralization: The New Paradigm in Data Science

Decentralization: The New Paradigm in Data Science

Evolving data science paradigm enables faster experimentation, lowers risk, requires less overhead, and delivers consistent value.

Decentralization: The New Paradigm in Data Science
Leslie Barthel

Artificial intelligence dates back to 1950, when Alan Turing first proposed that computers would think like humans.  Now, more than 70 years later, AI touches many aspects of our everyday lives. 

However, according to Gartner, over 85% of AI and machine learning projects fail. There are many reasons behind this failure rate, but ultimately it boils down to the data. 

Data is the fuel that powers AI. In the last decade, we've seen exponential YOY growth in the amount of data worldwide, with no signs of slowing down. Unfortunately, as data increases, so does the chasm between how much exists and how much gets analyzed. In fact, according to Forrester, 60-73% of all enterprise data goes unused for analytics purposes. Much of that is due to the unstructured and decentralized nature of data today. Valuable enterprise data spans different data silos, systems, and jurisdictions, making it nearly impossible to derive value using the typical centralized approach to AI and data science projects. 

Centralized Approach: CRISP-DM & its Challenges


Initially developed by IBM, the CRoss-Industry Standard Process for Data Mining (CRISP-DM) serves as a standard model for the data science lifecycle. It consists of six stages and outlines the typical tasks and the relationships between them.

  1. Business Understanding - Set objectives and requirements for the project.
  2. Data Understanding - Identify, collect, and analyze data to achieve project goals.
  3. Data Preparation - Select, clean, construct, integrate, and format the datasets in preparation for modeling.
  4. Modeling - Build and assess models. 
  5. Evaluation - Evaluate model results and determine the next steps.
  6. Deployment - Deploy the model. 

Despite being the industry standard for over 25 years, CRISP-DM has a few flaws. 

First, the models' symmetry implies an equal amount of time, and emphasis should be attributed to each step. However, this is far from the truth. As outlined in a previous article, understanding the business problem and thoroughly analyzing the data available to solve that problem is essential to the success of any project. By focusing more on the first two steps, data scientists will be able to provide a strong foundation for the effort and assure better results for subsequent phases. 

In addition, CRISP-DM depicts data at the center of the model because traditionally, most data science required data to be centralized and combined to gain value from it. However, as we know, data is not simply in one place. As a result, data scientists have to go through a lengthy process to acquire access to many disparate data sources—often waiting weeks or even months just for approvals. And once they gain access, they can't simply copy it from one place to another. That's because many datasets contain personally identifiable information (PII), subject to different privacy, regulatory, and compliance constraints depending on the nature of the data and where it lives. 

Suddenly, the project requires cross-border data migration, redaction, anonymization, and synthetic generation—destroying potentially valuable raw signals in the process. Not to mention the ETL overhead of extracting, moving, cleaning, and transforming different datasets.  

AI projects will continue to underperform if we depend only on the data we can copy, move, warehouse, or gain approval to analyze. This evolutionary challenge necessitates a new approach.

Introducing the New, Decentralized Paradigm

Decentralized CRISP-DM Model

The new and improved approach considers the distributed nature of data today. Instead of bringing the data to the model, it brings the model to the data—training algorithms where the data resides. It also bypasses certain regulatory limitations by eliminating the need to move data across borders. 

As a result, data science teams no longer need to waste time, effort, or money centralizing data, significantly reducing ETL overhead. They also no longer have to wait to gain access to the data or worry about anonymizing or redacting sensitive information. 

Instead, they can train algorithms without leaking the privacy of the underlying data sources and be more agile as an organization via rapid experimentation.

Analyze Data Where it Resides with Complete Privacy

How Devron Works Diagram

Devron makes this decentralized data science paradigm a reality, allowing data scientists to train models at the point of data collection. 

Devron's federated machine learning platform consists of two components: Control Center and Satellite(s). The Control Center governs the federated cohort of datasets. Each dataset is essentially a “Satellite” and can be anything from a file in public cloud storage to a SQL database to a legacy system. 

The Control Center allows data scientists to configure satellites, develop models, and run experiments. Likely, several Satellites will connect to a single Control Center, and each will train and send back local results. The Control Center then aggregates those results and provides Satellites with an updated model for continued training.

Devron brings analytics to the data instead of data to the analytics. Raw data never transfers from the Satellites to the Control Center—only model metadata. And the platform leverages Privacy Enhancing Technologies, like secure multiparty computation, differential privacy, and encryption to prevent this metadata from being reverse-engineered.

As a result, data stays put, data lineage is preserved, source data remains private, and data science teams obtain faster access and approvals. 

To learn more about this new decentralized paradigm of model training and how Devron is deploying this approach, watch the recording of our webinar, Accelerating AI Business Value