GenAIs hidden challenge: Mastering unstructured data

Carl D’Halluin, CTO, Datdobi, says it’s time for companies to start focusing on the management of data versus the devices on which data is stored.

In a recent McKinsey Global Survey, 65% of business respondents said they regularly use GenAI, nearly doubling the figure from a survey ten months earlier.

These organizations are already seeing direct impacts on their business from using GenAI. Costs are decreasing and revenue is jumping in the areas where they are fully engaged with the technology.

Expectations are also high – three-quarters think GenAI will lead to significant or disruptive change in their industries.

It’s easy to see why we’ve reached those figures already. GenAI offers enormous benefits in automating routine processes and streamlining operational efficiencies. It paves the way for smarter, data-driven decisions and unlocks new ways to engage with customers and improve the services they receive.

GenAI is not auto-magic

But GenAI is not automagic – it can’t operate in a vacuum and is only as good as the data it’s given.

GenAI requires training on high-quality, well-curated data to get actionable value. And that means weeding out the garbage in billions of files.

The problem arises from the simple fact that most data is unstructured. It includes text, videos, images, social media posts, and a lot more.

Unstructured data makes up 80-90% of all data generated, which, in real terms, is 132 ZettaBytes of data created in 2023 alone, 64% of which came from enterprises. Mindboggling, to say the least.

So, the cold, hard truth is that GenAI will not perform without some organization of your unstructured data.

Do you really know what is in your unstructured data?

Bart Willemsen, VP analyst at Gartner, sums up the ‘garbage in, garbage out’ scenario perfectly when he says: “I don’t care how good the AI technology itself is, if you have crappy data, you will have crappy AI… Most companies don’t actually know the data they’ve accumulated – in some cases decades and decades of history.”

GenAI models can struggle to interpret unstructured data correctly – it could come in many formats or contain irrelevant information. You can be left with unreliable outputs – rendering the GenAI model ineffective.

Unlocking the value of unstructured data is pivotal in maximising the impact of GenAI. And that comes down to these key success factors:

Get 360-degree optics

That means an enterprise-wide view, so decisions can be made about what data has potential value. But what do we mean by this?

You want your (potentially) billions of files visualised, identified and organized. Sadly, time is not on our side as GenAI capabilities outpace our ability to use it properly. We need to be able to communicate with data scientists busy in the backroom, speeding up the identification of the correct data to train GenAI models that work for your business.

With the complex issues of ethics and data protection, governance is crucial, as is whether we can use some of the data for GenAI or, in fact, not. A clear, macro view of your data means you get an accurate picture, but it also reassures you that you are doing right by your business, its customers, employees, etc.

Conversely, a micro view can give you a granular view of specific applications, projects, age, ownership and other characteristics that data scientists also need to focus on.

Minimizing the effort

Billions of data files equate to an inordinate amount of effort to sift the correct data to train and/or augment large language models (LLMs). LLMs rely on massive datasets to learn patterns, so choosing the correct data is critical for producing accurate, useful, and unbiased results.

Often overlooked is how data stored in the data lake is turned into quality datasets. If you put garbage in, you will get garbage out, therefore, we can’t overlook this critical first step of data preparation and management.

Much like managing traffic on a busy highway, we must ensure smooth data analysis to prepare it for training GenAI models. If we get the data where it needs to be first, we can achieve the ROI promised by LLMs.

Modern data management is a requisite for GenAI

Historically, there have been no tools to help manage the unstructured data since it has just been far easier to add storage. Accumulation rates have jumped sharply higher in recent years and growth will continue due to AI and machine-generated data. That means it’s prime time to adopt modern data management technology that will help you stay on top of the curve. It’s time to start focusing on the management of data versus the devices on which data is stored.

Data management technology can provide deep insights and actionable intelligence on unstructured data, giving organizations the power to effectively train GenAI models with high-quality data. Effective data handling done properly reduces risks and enhances decision-making, making unlocking the value of unstructured data for GenAI initiatives easier.

Notifications

GenAIs hidden challenge: Mastering unstructured data

Intelligent Technologies

Intelligent Verticals

Countries

Analysis

Other Regions