Studies have shown acquiring a new customer can cost five to seven times more than retaining an old one. That’s why customer churn is a top concern for most companies, particularly those with a subscription-based business model.
Losing customers can lead to decreased revenue, increased costs to acquire new customers, and lower customer loyalty. Additionally, customer turnover can damage a brand’s reputation, making it difficult to attract new customers.
The ability to predict churn offers several benefits. If businesses can identify customers likely to defect before they do, they can intervene with personalized offers, promotions, and proactive customer support to reduce attrition rates. In addition, predicting customer churn allows companies to allocate resources more efficiently, focusing on retaining high-value customers. As a result, businesses can improve customer retention rates, reduce customer acquisition costs, and increase revenue.
In this blog, we’ll dive deeper into the customer churn use case. We’ll highlight the challenges of customer churn prediction using machine learning on private customer data and how Devron’s platform can help solve them, allowing companies to manage customer retention more effectively.
The Challenge: Private Customer Data
In this scenario, we’ll assume the role of an analytics firm hired by a Telecommunications company to predict customer turnover. To do our job, we’ll need access to the Telco’s customer data to predict churn based on past customer behavior.
This is where we, the analytics firm, could hit a snag in our process. It is also where even internal employees of the Telco have a hard time. The Telco's data contains sensitive, personally identifiable information about their customers, making it difficult to share with employees, let alone a third party. Traditionally, this would result in a lengthy data approvals process involving legal and compliance teams from both sides and delaying the project (and insights) for months on end.
Once we traverse those legal hurdles, the struggle isn’t over. Before the data can be shared with us, our client will likely need to undergo an extensive de-identification process to anonymize the private information. Although this will effectively protect consumer privacy, it could potentially eliminate essential signals in the data that are significant to the churn model we’ve been tasked with creating. Then, our client must identify a secure approach to actually transfer the data to us, which could be voluminous and result in cloud egress charges, other infrastructure costs, and risk.
Data sharing is not only a challenge for third-party service providers. We’ve increasingly seen this within large organizations themselves. Whether fueled by growing privacy regulations or increased data leakage events, there is a general reluctance to share data across departments, business units, or even with a centralized analytics team.
The Solution: Devron, a Federated Data Science Platform
Luckily, there is a solution that makes accessing private datasets across departments, organizations, and jurisdictions easy and secure. That solution is Devron, a federated data science and machine learning platform that enables data science teams to build and train AI models on distributed, private, and heterogeneous data sources where the data resides.
Devron’s platform consists of a Control Center and one or more Satellites. The Control Center is where data scientists confirm and explore connected datasets, develop models, and run experiments. Each Satellite is proximate to a data source and connects to a dataset in a read-only, privacy-preserving manner. As each Satellite completes individual model training, it sends model weights and artifacts back to the Control Center.
Raw source data never transfers from the Satellites to the Control Center—only model metadata. Additionally, the platform leverages privacy-enhancing technologies to protect the metadata against being reverse-engineered.
Ultimately, this means that with Devron, the data doesn’t need to move or be exposed to build a model and gain insights from it. As a result, the analytics firm can drastically accelerate the data approvals process, eliminate the need to move data, reduce legal and compliance fees, and cut out the requirement to obtain additional 3rd party tools to obfuscate data values since the data remains in place and the raw values are not shared.
Devron In Action: Building A Customer Churn Model
Now, let’s get back to building our customer churn model.
In this scenario, we’re accessing a single private dataset that contains a multitude of information about the Telco’s customer base, including non-private data like current phone service and tenure, as well as personal information such as age, gender, and marital status.
Also included in this dataset is a value called "churn,” representing whether the customers churned or not. If the “Churn” column has a "Yes" value, that customer did not renew their service. Thus, this is our target variable and the value we’re trying to predict.
Exploring Private Datasets
As we’ve already said, we can’t see the raw customer data that the Telecommunications company owns. So you may be asking yourself, how are we supposed to explore the dataset if it is private?
To give data scientists a sense of the data during the Exploratory Data Analysis phase, Devron provides access to what we like to call the three S’s: Schema, Statistics, and Synthetic Data.
As you can see below, this dataset contains 21 columns and 7,043 rows, and the schema includes features such as customer ID, gender, tenure, service packages, payment method, and monthly charges.
In addition, Devron generates synthetic data based on the source connected to the satellite. This synthetic data allows us to explore and experiment before running our logic on the remote satellite data—allowing for more rapid experimentation.
When we call devron.synthesize, Devron generates a data frame of 1000 rows of synthetic data statistically representative of the actual data. This data will also be formatted like the real data, allowing us to develop and subsequently test our data processing pipeline before sending it to the Satellite.
The synthetic data for this dataset looks like the following:
It is important to note that Devron never trains on synthetic data. It is meant purely as a sight aid to build and validate a data preprocessing pipeline. Only the real data from the Satellite is used for model training. Thus, valuable signals in the real data are not lost.
Preparing Data for Modeling
We can now build a data preparation pipeline using the information gathered during the EDA phase. This can include any data cleaning or feature engineering required by the dataset or use case. Devron allows you to specify the data preparation pipeline for each Satellite individually.
Below is an example of a pipeline for the Telecommunications company’s customer churn data. It includes:
- Removing columns
- Removing missing values
- Cleaning string values
- Ordinal encoding
- One-hot encoding
After testing the preprocessing jobs on the synthetic data, we can see that we have entirely numericized the synthetic data: every column is now a numeric type. To accomplish this, the pipeline above dropped the personally identifiable information of customer ID and gender and cleaned some of the strings in specific columns.
Another relevant step is feature engineering, where we transform the data further to best accentuate essential features which can improve model performance. This entailed using ordinal encoding to convert Yes/No categorical columns to 0 or 1 numerical values; using min-max scaling to put all continuous columns into the range (0,1); and lastly, one-hot encoding categorical columns with more than two categories.
Training & Evaluating the Model
We’re finally ready to train a machine learning model on remote, private data.
For the use case of predicting customer churn, we’ll be training a logistic regression model. The target column will be “Churn” since this is what we want to predict, and we will use a test split of 25-75, where 25% of the data will be kept for testing, and the model will only be trained on 75%. See the training below:
After the model training has been completed, we need to examine the model's performance. As shown below, we can find metrics about its performance on the 25% test split of the data:
The base heuristics show that the accuracy (the number of successful predictions divided by the total number of predictions) is 79%. However, the precision and recall of predicting churn are only 61% and 53%, respectively. To improve these scores, we could tune our hyperparameters, refine our data preparation pipeline, and/or add additional Satellites.
The detailed metrics show that the support (size) of the test data includes 1300 non-churn customers (about 74%) and 457 churn customers (about 26%). So, by using a model of guessing non-churn every time, the model accuracy would be about 74%.
At this stage—in a real scenario—we'd iterate on our model and try to make adjustments to improve accuracy. One reasonably easy step would be to balance the training and test datasets by ensuring an equal proportion of churn and non-churn entries. However, we will leave the process of improving the model to another blog post.
Ultimately, a high-performing customer churn model can empower businesses to be more proactive, data-driven, and customer-centric in their retention strategies. This predictive capability not only enhances retention rates but also maximizes customer lifetime value by enabling companies to implement targeted engagement strategies and improve product offerings. In the end, a high-performing customer churn model will lead to greater customer loyalty, sustainable revenue growth, and better-optimized business processes.
Unlocking Private & Restricted Datasets for Advanced Analytics & AI
Whether you’re struggling to gain access to a sensitive dataset that’s out of your reach or a private dataset that a third party owns to build a churn model (or otherwise)—Devron can help.
Devron’s federated data science platform enables teams to leave their elusive or private dataset where it resides—compliant, secure, and in the data owner’s control—while at the same time unlocking the ability to perform advanced analytics on it. As a result, data science teams can accelerate their data science workflow and time to value, as well as build more accurate, generalizable, and reliable models.
Are you interested in diving deeper? Watch an on-demand demo of Devron in a Jupyter Notebook. In this recording, we walk through the step-by-step model-building process for a customer churn prediction using machine learning on a private dataset.