Is it bad data or bad design causing AI projects to fall short of expectations?

Derek Cowan, Director of System Engineering APAC at Cohesity, discusses how the right data can have a transformational impact for Artificial Intelligence and Machine Learning projects.

The growing adoption of Artificial Intelligence (AI) and Machine Learning (ML) technologies are becoming key drivers for organizations looking to become leaders in their fields and earn the label of being ‘best in class’. However, sadly, too many AI and ML projects fail to reach their full potential.

There can be numerous or varying reasons for this, including poor goal setting, budgetary constraints, and function creep during the planning, proof of concept and realization phases.

Bad data is also a key reason for project failure. It comes down to the oft-repeated truism, ‘Garbage In, Garbage Out’ which is as relevant today as it has ever been.

Two of the four key roles and responsibilities for AI projects outlined by analyst firm, Gartner, relate to data. Alongside AI Architect and ML Engineer, the Gartner lists Data Scientist (responsible for identifying use cases, determining which data sets and algorithms are required and building AI models), and Data Engineer (responsible for making the appropriate data available, with a focus on data integration, modelling, optimization, quality and self-service).
Organizations ignore these data-related roles and the wider importance of data at their peril and risk falling on the wrong side of the 50% of IT leaders Gartner has identified that will struggle to move their AI projects beyond proof of concept before the end of 2023.

Work backwards from your intended outcome

No AI or ML project has a chance of being successful if there is not an accompanying data strategy. A key aspect of getting this in place is to work backwards from your desired outcome to figure out what ingredients it requires.
For this the project goals should be articulated with crystal clarity. It is not enough to say ‘AI will help us streamline our production plant’. That needs to be broken down into greater detail, with a focus on specific process elements.
Now it is possible to see what data is required to achieve each goal. And this can’t be generalised. It is important to list all the data elements required. These might not all exist before the project begins, working out what they are and how they will be collected, is vital.

This is often much trickier than it sounds, and external data scientists and data engineers, with experience of working in AI/ML development, will bring an ability to ask the right questions, look round corners at problems, keep a lid on function creep and make sure the most difficult questions are addressed rather than parked.

Bringing them in early can mean an organization doesn’t find itself having to do this work at a later stage when it can add expense and time to a project or worse – contribute to its failure.

By the end of the ‘working backwards’ process, an organization should know what it needs to progress with confidence having worked through what they need to know, what knowledge of these needs that it already has and what it needs to obtain this know-how.

Establish an effective data policy

Organizations can’t assume that because they collect data already, that they can just pass it right over to the AI/ML developer and it’ll drop neatly into new applications, out the other end of which will come amazingly informative dashboards of new information – if only it was that simple.

The way data is collected has changed in recent years in part due to the European Union’s General Data Protection Regulation (GDPR), and the extensive updates of the Australian Privacy Act.

Therefore, this year’s data set isn’t compatible or comparable with that from four years ago from a GDPR perspective, and compliance with the Australian Privacy Act may require more from your organization than what has previously been in place.

Further, with the draft of the Online Privacy Bill set to be tabled in Federal Parliament in the coming months, data privacy requirements are only set to increase. Perhaps significant amounts of historical data – even recent historical data – are missing.

Perhaps the organization needs to put in place entirely new data collection policies to start from a designated ‘Day One’.

Working out a ‘Day One’ data policy is one thing, however, to hit the ground running with an AI/ML project as soon as it kicks in, some historical data will be useful. This requires several decisions to be made, including which data sets are most important, how far back to go, whether to work on a subset for proof of concept and bring in more data sets later, and whether some poorly managed data can be cleaned enough for use – or not.

Making the right decisions will be vitally important, and this is another area where that trusted third party view of the AI/ML specialist will be extremely valuable.

Dismantle data silos

It is possible an entirely new data policy will be needed going forward to keep the AI/ML system fed with the right quality data. Not only might new data need to be gathered, but new working practices might also be needed. This means there could be significant implications across the whole organization. For example, it might be important to do a ‘once-and-for-all’ purge of data silos that can often hold data-related projects back.

Recent research has found that many IT teams are spending 40% of their time managing and maintaining data infrastructure, and only 32% of data available to enterprises is put to work, while the remaining 68% goes unleveraged.
In too many organizations there are still different lines of business capturing the same data for their own use. This is cost-inefficient, causes data silos which lead to data fragmentation and governance issues, and inevitably means there is variance in the accuracy and quality of data.

Which data set should the AI/ML project use? That’s the wrong question to ask. The right question to ask is ‘How do we ensure there is just one set of this data, shared across all lines of business?’

Ask the question, find the answer, then implement it and repeat for all silos. This will help the current AI/ML project immensely, should create cost-efficiencies, and should support future AI/ML and other projects going forwards. It will also be beneficial for other data management processes such as backup, restore and archive.

Even in the era of Digital Transformation, Fortune 500 companies take weeks or, most often, months to deliver clean data to their teams, often mandating a carefully co-ordinated effort across multiple teams. Further, this has necessitated the use of ingenious, albeit often insufficient, methods such as the use of synthetic data sets or subsets of data.

With Cohesity, there is an answer. Users can instantly provision clones of backup data, files, objects, or entire views and present those clones to support a variety of use cases. Cohesity’s zero-cost clones are extremely efficient and can be instantly created without having to move data.

This is in stark contrast to the inefficiency of the traditional DevTest paradigm, in which full copies of data are created between infrastructure silos. This is a dramatic shift to modernization.

By decoupling data from the underlying infrastructure in this way, we enable organizations to automate data delivery, and provide data mobility. Zero-cost clones can be spun up in minutes rather than weeks. As a result, customers have reduced their service level agreement (SLAs) for data delivery, accelerated application delivery and migration, and greatly simplified their data preparation.

With the right data flowing in, AI and ML projects can provide dashboards of insights that can be used by the organization in the transformational ways it envisions. Focusing on the data from the start of an AI or ML project can help an organization land on the right side of Gartner’s 50%, however, this focus must occur from the outset.

Notifications

Is it bad data or bad design causing AI projects to fall short of expectations?

Intelligent Technologies

Intelligent Verticals

Countries

Analysis

Other Regions