The New & Improved Data Science Workflow

The New & Improved Data Science Workflow

Learn about the new data science workflow that's simplifying your data science to-do list and accelerating your time to insight.

The New & Improved Data Science Workflow
Kai Wombacher

Today, data science plays a part in virtually every enterprise and across every industry. Its sole purpose is to solve problems using data. It’s used to do everything from predict customer churn to detect cancer in patient images and prevent money laundering or enable predictive maintenance

However, the data required to solve these problems is often distributed across different systems, departments, jurisdictions, and privacy tiers. Gaining access to the data for model training is a long and expensive process that frequently causes data science projects to run over time, over budget, and in many cases, fail altogether.

In this post, we'll take a closer look at the traditional data science workflow and identify the inefficiencies and failure points in the process. We'll also illustrate how Devron's federated machine learning platform mitigates or avoids these costs and difficulties by skipping data movement altogether. As a result, Devron empowers data scientists to unlock new data and improve their models' performance and generalizability without altering their existing workflow—only streamlining it.

The Traditional Data Science Workflow

The workflow below reflects a data scientist's typical steps for building a machine learning model. It consists of three phases: (1) exploratory data analysis, (2) data extraction, transformation, and loading, and (3) model training. The time and difficulty of each phase vary significantly from project to project and entirely depend on the data complexity level.

Before you start the initial phase of exploratory data analysis (EDA), you must first define the problem you're trying to solve. It will directly affect the type of data you need, as well as the complexity of your project. Framing the question correctly to address the business need is the foundation on which the entire project is built.

1. Exploratory Data Analysis

After clearly defining the problem, you must determine whether you have the data available to deliver the desired business outcome. The EDA phase is just how it sounds—it centers around finding, gaining access to, and exploring your available data.

For instance, if a telecommunications company is trying to predict the probability of customers churning (canceling their subscriptions), they must ensure they have historical data on churned customers. This way, their model can learn the difference between customers who cancel versus keep their subscriptions.

Sometimes (but not always), it’s easy to identify the necessary data, including if/where it resides. However, justifying the requirement and gaining approval for access from the data owners can be tricky, especially if you must navigate the complicated legal, privacy, and compliance triad.

Traditionally, you'll go through a data approvals process to obtain permission to access, duplicate, and move data for analysis purposes. This data may be within the same organization but owned by another department, or it may be data that is entirely external to the organization and owned by a partner. In either case, gaining access to that data may require legal agreements and potentially additional privacy-preserving steps, like masking, hashing, or encryption, to protect the nature of the sensitive data. As a result, the data approvals process can be lengthy, lasting days, weeks, or sometimes even months. And it can cost thousands or even tens of thousands of dollars in legal fees.

Once you finally obtain access, you'll explore the data's validity, cleanliness, and consistency. An effective EDA includes understanding its schema and running queries, generating analyses, and creating visualizations to understand the data's key attributes, such as its distribution, correlations, and structure. This information helps you understand your data and how it needs to transform to use it to train your model.

2. Data Extraction, Transformation & Loading

Traditional data science requires data to be centralized in one place before you can gain value or insights. In the data extraction, transformation, and loading (ETL) phase, you do the bulk of the centralization work—deduplicating, moving and unifying the data.

First, data is pulled or extracted from its source, such as CSV files, MS SQL, Snowflake, etc. The information is then transformed or cleaned. Most data will require some transformation, like removing incorrect, corrupt, misconfigured, duplicate, or incomplete data.

Once the data is prepared, the different datasets are combined to create a single, consistent, and cohesive dataset. To do so, you'll configure the 'joins' between the data sets by identifying the join type (inner, outer, left, right), selecting the foreign keys, and matching algorithm (exact match or fuzzy match). For example, you could perform an inner join on a "customerID" in your transactions and customer demographics datasets. This join would allow you to build models using customer demographic data and transaction history. Once unified, the dataset is loaded into relevant storage, such as MS SQL, for easy access and use.

Unfortunately, data today is distributed across different data silos, departments, systems, and regulatory and privacy jurisdictions, making the ETL process challenging and time-consuming. According to a survey of 2,300 data scientists, “45% of their time is spent getting data ready (loading and cleansing) before they can use it to develop models and visualizations.” The more time scientists spend preparing their data, the less time they have to build or refine their models and extract more insights from their data.

Finally, extracting data from one system and loading it into another comes with enormous security and data governance restrictions. Data owners will want to ensure the system the data will be loaded into is secure and that processes are in place to govern who has access—adding additional time and complexity to the project.

3. Model Training

Model training is the third and final phase of the traditional data science workflow. This is where the fun really starts. Model training begins with feature engineering (a.k.a. data-wrangling) to make the data machine readable (e.g., one-hot encoding, scaling).

Next, you select your target variable and feature set. For instance, if you are trying to predict when customers will churn, you can use a "Churn" column (boolean) as your target variable and use columns such as "number of transactions," "zip code," or "family size" as features to predict "Churn."

Finally, once the training data is set, you must choose the optimal machine learning algorithm and training parameters. For example, if you have a small dataset, you wouldn't select a neural network algorithm as they (typically) require large amounts of data. However, model training is an iterative process, and you may need to test multiple algorithms with different training parameters before you find the right one for your particular business problem.

Once your model is trained, you can evaluate its performance on hold-out data (data not used in model training). A model's performance is assessed based on several metrics, including accuracy, precision/recall, and F1-Score. Evaluating the model can vary depending on your specific use case and data. For example, if you are trying to train a model to predict cancer, you would want to ensure that the number of false negatives your model produces is as close to zero as possible. This would prevent you from telling a patient they don't have cancer when they actually do.

The ideal model is rarely produced after the first round of training. Instead, model training is an iterative process that repeats until you're satisfied that the model will solve your business problem. The length and difficulty of this phase vary greatly depending on the problem's complexity, your data's quality, and model selection.

The New & Improved Data Science Workflow

Devron has created an end-to-end data science platform that leverages federated machine learning and a series of privacy-enhancing computation techniques to solve the data movement and privacy issues inherent to the traditional data science process. 

It takes into account the decentralized nature of data, and instead of bringing the data to the model, it brings the model to the data, transcending many regulatory restrictions that limit access to data and significantly reducing organizational risk and ETL overhead.  

The Devron platform architecture consists of a control center that orchestrates the distributed model training, which occurs locally on satellite machines (compute that is proximate to the datasets) with read-only data access. The model trains locally on each dataset and then aggregates the learnings into a single global model. The diagram above illustrates this relationship between the control center, satellites, and datasets.

So, how does the Devron platform transform the data science workflow?

1. Explore More Data in Less Time

First, Devron significantly simplifies the data approvals process. By leaving the data where it is and keeping raw values private, authorization to access, copy, and move data is streamlined. Instead, you only need the approval to bring your model to the data, granting it read-only access. 

This means data no longer needs to move across departments, organizations, or jurisdictional borders to be analyzed, accelerating your time to access and gain insights. A more straightforward process also means lower legal fees and compliance overhead.

In addition, Devron unlocks access to more datasets, including those previously out-of-reach, due to privacy, regulatory, or risk concerns. It can do this because it doesn't expose the underlying source information, only sharing updated weights and parameters with the global model. 

Of course, when you hear that, the first question you'll ask is: how can I build a model on data I cannot see? This is the fundamental tension of privacy-preserving machine learning—information must remain private, compliant, and secure at its location. But, at the same time, data scientists need a thorough understanding of the data to develop an effective model. 

Devron alleviates this tension by generating representative synthetic data and statistics for use during the pre-processing phases (before training). The image above shows an example of the synthetic data that Devron creates. The artificially created synthetic data reflects the source data while maintaining the privacy of the original data. You can leverage it the same way you would source data in the EDA phase, including making queries and visualizations. Additional statistics include the number of missing values and unique values for each column.

This is particularly useful regarding datasets containing restricted information, such as Personal Identifiable Information (PII), Protected Health Information (PHI), or proprietary data. It enables them to be used for analysis purposes, including in third-party sharing scenarios and data monetization applications. For example, Devron facilitates machine learning on data from different hospitals that would usually only be able to use their own data. Creating a model with insights from the data from multiple hospitals improves analysis and potentially saves lives.

Increasing the amount of data available for analysis can result in better models overall. In fact, research shows that as training data grows, predictive accuracy increases, and estimation variance decreases. 

2. ETL Without the E or the L

By leaving your data where it resides, you no longer need to extract or load it into another system before analysis, significantly reducing data engineering overhead, costs for various ETL/data transfer tools, and data lineage issues. Instead, you can skip straight to the transformation portion. 

Devron allows you to create data preparation pipelines for each dataset, enabling you to perform whatever data cleaning or feature engineering is necessary. You can flexibly use python packages of your choosing for data preparation (e.g., Pandas, NumPy, Sklearn, etc.). In addition, you can test your pipelines on the synthetic data before running on the satellite, which allows for more rapid iteration and experimentation.

After preparing the data, you'll configure the dataset relationships like the standard data science workflow. However, the data is not 'joined' or aggregated together in one place. Instead, you identify the foreign keys between datasets, and Devron uses those relationships when training. For example, in the image below, you can configure the relationship between four datasets: credit card transactions, demographics, payments, and retail banking transactions. You need only specify the foreign key between the datasets to create a 'cohort' (group of datasets) for model training. Additionally, you do not need to relate each dataset to all the other datasets—they can be linked, as shown below.

3. Localized Model Training & Faster Time to Insight

With Devron, feature selection, algorithm selection, and hyper-parameter tuning are identical to the well-established data science process. Model training is also similar, except it occurs locally at each satellite. The satellites then send the trained model weights and artifacts back to the control center. 

Devron never transfers raw data from the satellites to the control center. Instead, weights and artifacts from each model trained locally at each satellite are aggregated to create the 'global model'—a composite model that leverages the insights from all the datasets.

After model training (just as in standard data science), you then evaluate the model's performance. With Devron, you can assess the global model's performance metrics on each satellite's hold-out dataset—as opposed to the standard where the model's performance is generally evaluated on a single test set. Additionally, it’s possible that retraining a given model will enhance the model's performance on some satellites, but other satellites will show the model performance getting worse. As a result, you will be able to understand your global model’s performance at each satellite, which helps ensure your model is generalizable to more data and will continue to make good predictions once it is served.

If the model's performance is not to your liking, you can adjust the data preparation pipeline, feature selection, model selection, or hyper-parameters. Once the model performance is satisfactory, you can incorporate it into the desired applications just like any other model.

Devron's federated machine learning platform enables you to train models without navigating the complexities of gaining access and moving data. Furthermore, Devron manages all the intricacies of federated learning on the backend, so you can build superior models on distributed data with a similar (yet simplified) process to what you use today. As a result, Devron empowers you to unlock new data and improve model performance and generalizability. 

To see firsthand how federated learning with Devron reduces ETL overhead, improves model performance, and enables faster insight, book a demo today.