Credit card fraud poses a significant challenge to financial institutions, resulting in direct monetary losses, reimbursement of customer funds, and the cost of investigating and resolving fraud cases. For example, according to the Federal Trade Commission, the consumer finance industry received 2.4 million fraud reports in 2022, resulting in over $8.8 billion in losses.
Not surprisingly, the ability to accurately predict unauthorized transactions before they clear is a top priority for financial institutions. Preventing credit card fraud today involves leveraging advanced analytics, machine learning algorithms, and historical transaction data to identify patterns and potential indicators of illegal activity. Successfully predicting and mitigating the effects of fraud has many benefits to both consumers and institutions, including enhancing customer protection, reducing financial losses, and safeguarding the firm’s reputation.
In this article, we will dive deeper into this real-life use case. We’ll discuss the challenges in building a predictive fraud model across multiple datasets, including across numerous institutions, and how federated learning can help address them—resulting in a superior model in a shorter time.
The Challenge: Distributed & Private Data
Suppose a bank's corporate analytics team wants to improve its ability to predict unauthorized charges. The bank may look to use many disparate sources of data spread across systems and jurisdictions and/or owned by various divisions (e.g., customer demographic information, non-credit card transaction data, regional transaction data, etc.).
Or, the bank may focus primarily on historical customer credit card transactions. By working with other banks, each can amass more transaction data, leading to more accurate detection. However, the data on which to train the predictive model is scattered across three participating banks, none of which would willingly share their customer transaction data with others. However, if each could protect the privacy of its data and retain control of it within their environments, then each bank could share its data and benefit from that of others with confidence.
Traditionally, to access their own data silos, the bank would need to build lengthy ETL pipelines to duplicate and move the data from each system and jurisdiction to a centralized store. This process typically takes many months and requires costly data movement tools, as well as additional infrastructure costs. In addition, since the datasets contain sensitive PII information, the bank must be mindful of complying with various privacy regulations to provide the analytics team access to the data to build their model.
The Solution: Devron Federated Data Science Platform
Employing Devron’s federated data science and machine learning platform, the banks can safely share their data, retaining control over privacy and keeping their data within their respective environments.
Devron’s state-of-the-art privacy-preserving federated learning architecture allows data scientists to develop and train models remotely using a notebook interface. From the Control Center (the orchestrator to explore, develop and run experiments), the data scientist can execute Python code on Devron Satellites (the proximate compute at each data source). Then, as each Satellite completes individual model training, it sends model weights and artifacts back to the Control Center.
The key benefit of federated learning is the ability to build a global model that realizes superior model performance, generalizability, and better insights by training on more data without the effort of centralizing data.
Building a Superior Fraud Prediction Model using Devron
For this specific use case, we’ll build a horizontal federated learning model using the Devron platform. Horizontal federated learning is used when datasets have the same schema, i.e., share the same features but have different rows. In this scenario, each bank collects the same core customer data per transaction, such as name, address, income, and average balance; however, each bank has its own individual set of customers.
Devron’s proprietary APIs allow data scientists to easily link together horizontally split datasets and seamlessly map relationships between various features without moving or exposing the underlying data.
To demonstrate this capability, we used a Kaggle simulated dataset containing over one million rows and twenty-three columns. It includes 1,289,000 legitimate transactions and 7,506 fraudulent transactions from 1,000 customers and 800 merchants.
We split this dataset into three horizontal shards, representing each regional bank (referenced as A, B, C) for training. A fourth shard was made as a test set for model evaluation.
Exploring the distributed datasets
The first step is understanding the datasets at each Satellite. Devron provides three APIs for effective exploratory discovery analysis—or, as we call them, the three S’s: Schema, Statistics, and a sampling of Synthetic data. These three APIs enable a data scientist to learn the data shape, gain context on data formats, and understand the distributions of the features, respectively.
As you can see below, bank A has 29 columns and 472,488 rows, and the schema includes features such as transaction date, customer name, credit card number, and profession.
Devron allows you to pull this information for each regional bank, as well as generate synthetic data to get a feel for the data without actually seeing the sensitive source information. The synthetic data is a sight aid to build a data preprocessing pipeline and is never used for model training. Only the real data from the satellite is used for model training.
Comparing the synthetic data of all the banks, we see below that Bank B has PII columns for social security number and date of birth hidden by Bank B’s data owner.
Now that we have a better understanding of the data, we can obtain column-level statistics to drill down even further and get an overview of each dataset, as well as variables, interactions, correlations, and missing values.
Pre-process the data
With an understanding of the raw data, we can prepare our data. This can include any data cleaning or feature engineering required by the dataset or use case.
In this case, we built a fraud preprocessing function (shown below) that constructs features based on transaction attributes (amount, category, location, time), demographics of the purchaser (age, gender), and distance of merchant to the purchaser (via latitude and longitude coordinates). This preprocessing pipeline is then applied to the data hosted on each Satellite.
Horizontally train the model
After creating a preparation pipeline, the training can begin.
For our algorithm, we chose Extreme Gradient Boost (XGB), a tree-based model with suitable classifiers to detect fraud in unbalanced data. The main advantage of XGB is its ability to train on both categorical and numerical data simultaneously with very little data preprocessing and hyperparameter tuning.
We start by initializing an XGB model with appropriate hyperparameters, such as max depth and the number of boost rounds. We then used Devron’s horizontal federated train API to send the initialized model from the Control Center to train against each of the three regional bank Satellites.
The model is trained at each Satellite, sending updated, encrypted weights and parameters back to the Control Center for model aggregation. If desired, we can incorporate existing models and convert them to a federated capacity. Upon completion, the API returns a global model that includes the learnings of models from all three regional bank Satellites.
Results: Greater Model Accuracy & Performance
With a global model trained on all three remote datasets, it's important to compare how it performed relative to individually trained models to assess its benefits.
Figure 1 is a graph of the balanced accuracy of all four models. As shown below, the model for Bank A has a balanced accuracy of 65% and an F1-Score of 26%. In comparison, the global model improved balanced accuracy by 18 percentage points and F1-Score by 42.2 percentage points.
The global model outperforms the individual bank models because it can harness the data and learnings from all three banks. This improves its ability to detect fraud across the board, regardless of the regional bank associated with the transaction.
Compared to the individual models, the global model detected 5,151 more fraud cases than the Bank A-only model, 118 more than Bank B, and 4,325 more than Bank C when tested against the test data (Table 1).
Calculating the Financial Impact of the Global Fraud Model
Based on data from Consumer Sentinel, on average, a bank saves $650 for each fraudulent transaction detected. If we calculate the math, the global model can result in $3.34M in savings for Bank A, $76K for Bank B, and $2.8M for Bank C when applied to the test data. This results in over $6M in total net savings as a result of detecting fraudulent transactions across all three regional banks (Table 2)
In addition, it’s crucial to weigh the cost of model error before deploying it into production. In the case of predicting fraud, incorrectly predicting a transaction as fraud and denying it when it's legitimate is estimated to be up to 75 times more costly ($11,662) than missing out on a genuine fraud case ($650).
From this perspective, a model that doesn’t over-detect fraud and deny false positive transactions is more important than missing a genuine fraud case (false negative).
With this dollar quantification, we can compare the total missed opportunity costs of all models. Table 3 demonstrates how the global model could drastically reduce the missed opportunity cost of false positives across three regional banks by $39.6M if they use the global model instead of the individually trained models.
Overall, the financial benefit to the regional banks for selecting the global model over the individual models as the backbone of their fraud detection using machine learning system is over $45M, including the $6.2M in net savings of detecting fraud and $39.6M in total missed opportunity costs.
Build Better Models by Unlocking More Data for Advanced Analytics & AI
If you’d like to harness distributed (and private) data for advanced analytics but don’t want to go through the immense hassle of moving and unifying the data into one place—Devron can help. That data could be distributed across various divisions or jurisdictions within the same organization or across companies, as this multi-bank fraud use case reflects
Devron’s federated data science platform enables teams to radically speed up the AI development cycle and preserve privacy by bringing the analytics to the data instead of the data to the analytics. As a result, teams can unlock access to more data—allowing them to build more accurate and generalizable AI models.
In this fraud detection using machine learning use case, Devron enabled the analytics teams in each bank to develop and train a more accurate global model by allowing the secure sharing of data and combining the learnings from all three banks’ data. As a result, they were able to more accurately predict fraud and reduce the number of false positives to realize a potential savings of $45M for the organizations. And this was all achieved without the cost and time delays of creating an ETL pipeline or risking the leakage of sensitive information in data movement.
Like this real-life use case, enterprise analytics teams can leverage Devron to solve their most challenging distributed data problems. This superior AI platform enables data scientists to unlock data that was previously too distributed or private to use, realizing better models and hidden insights in the process.
Are you interested in learning more about Devron or this use case? Watch an on-demand demo of Devron in a Notebook interface. In this recording, we walk through the step-by-step model-building process for a fraud prediction using machine learning model across distributed and private datasets using Devron.