How financial institutions can build predictive risk models with Synthetic Data Generation

How financial institutions can build predictive risk models with Synthetic Data Generation

Ahmet Cenk, Senior Business Solution Manager, SAS

While most people are familiar with large language models like ChatGPT that generate text, Generative AI can also generate synthetic data. Synthetic Data Generation refers to on-demand, self-service or automated data generated by algorithms or rules rather than collected from the real world, explains Ahmet Cenk at SAS.

Every business decision comes with risk. Yet, the reality is that without risk, there is no reward. What matters most for organisations is to find trustworthy ways to navigate uncertainty, manage risk with confidence, mitigate threats, and capitalise on unforeseen opportunities. It all comes down to risk management and intelligent risk analytics.

Having the capacity to make fast and accurate risk decisions can dramatically change business outcomes. This is true for virtually any organisation and especially for entities that extend credit, from traditional banks to fintech, car dealers, mortgage companies, communications providers, government agencies, health care systems, insurance companies, credit management services and even retailers.

Given the unpredictable global financial environment, regulators are urging banks to identify risk exposures, as an effort to strengthen the resilience of the financial system. The Central Bank of the UAE, CBUAE recognizes risk management, along with internal audit and compliance, as key control functions in a bank.

The Saudi Central Bank, SAMA has also released a risk-management framework for banks practicing Islamic banking. In addition, financial institutions must tackle the risks associated with global warming, since the impact of climate change could set in danger assets worth trillions of dollars globally.

Amidst these conditions, organisations increasingly use risk models to feed automated decisioning processes, while also striving to quickly deploy new, innovative models to meet changing business needs. In recent years, the integration of Artificial Intelligence, AI and Machine Learning, ML in risk models has resulted into significant accuracy and efficiency improvements. That is because AI and ML systems excel in recognising patterns in data to make predictions.

So, what about data? Building and training these models with faulty base data can lead to adverse consequences from decisions made using incorrect assumptions and respective estimations. Consequently, having a strong data foundation with easy and fast access to large and diverse group of people, and high-quality data is of paramount importance when it comes to developing risk assessment models.

That is certainly not an easy feat. Getting that kind of real-world data to begin with, usually translates to high data acquisition and annotation costs, not to mention efforts required to analyse and profile the data. But even when businesses have plenty of real-world data available, additional challenges tend to arise. Data quality or historical depth of data may not always meet expectations.

What is more, in the financial sector, although an ocean of data is generated daily, organisations are required to safely handle sensitive personally identifiable information within the permitted regulatory compliance parameters or risk being levied substantial fines as well as potential incidents causing reputation damage, and data anonymisation has proven itself to be inadequate and overseen time to time.

There are also rare scenarios to consider, where organisations do not have enough data points for modelling due to a scarcity of events. For a newly set up digital bank or a consumer heavy bank looking to build up a strong corporate book might lack data for specific portfolio and events that are key inputs to a risk model.

How could it be possible then to extrapolate and model the unknown, ensuring there is no overfitting? That is exactly where synthetic data comes into play. The emergence of ChatGPT in 2023 has opened new avenues of AI innovation and sparked Generative AI evolution in several industries.

According to a recent global survey conducted by SAS and Coleman Parkes, 2% of data decision makers in the UAE and Saudi Arabia say their companies have fully integrated Generative AI into regular processes, while 48% are running initial tests for implementation which exceeds the global benchmark set at 43%. Another 34% intend to use Generative AI within 1-2 years.

In the case of Synthetic Data Generation, Generative AI goes beyond prediction and conversation; it generates new data as its primary output. And while most people are familiar with large language models like ChatGPT that generate text, Generative AI can also generate synthetic data. Synthetic Data Generation refers to on-demand, self-service or automated data generated by algorithms or rules rather than collected from the real world.

Going over and above generating data randomly, Synthetic Data Generation tries to multiply the real-world data by ensuring the correlations, distributions and patterns are not underestimated or overfitted.

A Generative Adversarial Network is the most popular technique for Synthetic Data Generation, mimicking specific distributions. There are two neural networks involved in training. One network generates the data, the generator while the other network tries to discriminate, the discriminator if that data is real or fake. If it is deemed to be fake, the generator is notified and tries to improve on the next batch of generated data. Therefore, the two networks are training against each other, hence the adversarial part.

A GAN’s training process can be effort-heavy and typically requires graphics processing units, GPUs, but it can capture highly nonlinear, complex relationships among variables and, ultimately, produce very accurate and realistic synthetic data. It reproduces the same statistical properties, probabilities, patterns, and characteristics as the real-world data from which it is trained, while it can also generate data at or beyond the boundaries of the original data, potentially representing novel data that would otherwise be neglected.

Using Generative Adversarial Networks for Synthetic Data Generation, together with reliable risk management solutions, financial institutions can gain a significant competitive advantage.

For example, they could synthetically generate default cases when real-world data is limited. Data quality is another challenge that can be mitigated with Synthetic Data Generation, since the available data is not always satisfactory. Synthetic Data Generation provides more cases and also eliminates issues that existed in the original data, ensuring high data quality, which in turn leads to better performance in mission-critical areas, including improved credit scoring accuracy and reduced decisioning bias.

The use of synthetic data can further help organisations secure the privacy of their real data. After all, there are many reasons why a financial institution would rather not risk using real customer data in the process, such as the involvement of third parties in model development, the use of cloud services, and regulatory concerns, to name a few. Synthetic Data Generation can produce entire sets of data that share the same characteristics with the real data, only without compromising the privacy of their customers.

Some Synthetic Data Generation techniques can be used to create data for simulations or hypothetical scenarios. Banks can simulate once-in-a-generation Black Swan events using sparse data sets, train models on new exogenous developments including climate change, and enhance micro, macro-economic and market condition simulations to determine potential risks before they become challenges.

In the field of fraud and financial crime detection and prevention, financial institutions can use synthetic data to extrapolate from rare events and anomalies to train models on specific fraud and anti-money laundering topologies. Synthetic data can also be useful in penetration testing of existing fraud control systems in order to fine-tune them for optimised defensive capabilities.

Recognising the value of synthetic data early on, Digital Dubai has launched a framework for its use in October 2022. Gartner predicts that by 2026, 75% of businesses will use Generative AI to create synthetic customer data, up from less than 5% in 2023. Moreover, the SAS and Coleman Parkes survey reveals that 24% of UAE and Saudi Arabia organisations are already addressing data challenges via Synthetic Data Generation while another 26% are actively considering following the same path.

Among the UAE and Saudi Arabia organisations that have adopted or are experimenting with Generative AI integration into analytics processes, 80% have bolstered their risk management and compliance efforts, achieving higher efficiencies, more accurate risk measurement of the customers yielding into a better performing portfolio and increase in profitability as well as better adherence to regulatory standards.

Ultimately, the goal should be to strike the right balance between risk aversion and business development. Organisations that can make informed decisions based on trustworthy, high-quality data will be able to successfully mitigate risks, maintain regulatory compliance and reduce the cost of doing business. The right partners can already help businesses in the financial sector and more industries transform the risk modelling and decision-making processes from time-consuming to automated and streamlined for next-generation risk management.

Click below to share this article

Browse our latest issue

Intelligent CIO Middle East

View Magazine Archive