Krishna Subramanian, Co-founder and COO, Komprise, takes our questions on the intersection of AI and data management.
Tell us what stood out to you this year in the 2024 State of Unstructured Data Management report?
This year we are seeing tension in enterprise IT between the need to be cost-efficient with data storage and data management while at the same time preparing the business for AI. Nobody wants to be left behind, but our research shows that 70% of organizations will not be able to increase their IT budgets to support AI projects. Therefore, paying for AI will need to be carved out of existing budgets – such as cloud computing – and IT leaders will need to find ways to optimize costs amidst continued data growth to allocate resources to AI projects. It will be interesting to see what strategies for AI execution will prevail as there is a lot of choice on the market already.
Did your survey uncover any trends in how companies will implement AI given these fiscal challenges?
There are many pathways to AI and our survey shows that enterprises are considering a mix-and-match method of keeping things in-house versus using ready-made AI services. For instance, 37% are custom training existing learning models while 32% are building their own LLMs from scratch. I’m guessing some organizations will do both. Only 32% are planning to leverage cloud-based AI, which seems low given the AI innovation that we’re seeing among the major cloud providers right now.
One priority seems to be creating AI-ready data infrastructure, the top strategy noted by 44%. What exactly does this entail?
Organizations realize that while there may be a few specialists who train AI models, for the most part, their users will want to leverage pre-trained AI with corporate data. AI inferencing and use is a much larger, broader market than AI training. Hence, IT organizations are increasingly investing in creating the appropriate data infrastructure to find, curate, audit and feed corporate data to AI while maintaining data governance. AI-ready data Infrastructure includes infrastructure to train or run AI models such as GPUs and AI accelerators and data management frameworks to automate AI data workflows and protect sensitive data. Some organizations are looking to run many of these functions in the cloud while others will deploy on-premises or across hybrid infrastructure. In all these scenarios, data is typically scattered across many silos and needs to be searched, extracted and fed to the AI model.
Interestingly, only 13% restrict the data and AI services employees can use. What do you make of that?
AI is still in its infancy especially when it comes to inferencing and use. Most organizations don’t yet have corporate policies or guidelines for employee use of AI and are taking a wait-and-see approach before developing institutional guardrails. This will evolve rapidly as users start leveraging AI for corporate applications.
What are enterprises struggling with most when it comes to managing unstructured data?
This year we asked participants to rank technical challenges separately from business challenges. As in 2023, moving data without disruption to users and applications is the top technical challenge. What commonly happens when migrating or archiving data to new storage is that users can’t find their data later: it’s no longer in the same file directory location so they must put in a support ticket for IT. Also, applications can break if they can’t find data that has been moved. These issues are caused by the proprietary methods (such as stubs) that some storage and data management technologies use to move data.
Another top technical challenge is using AI to classify and segment data. Data classification, which involves identifying and labelling unstructured data, helps people search across petabyte-scale, distributed environments for the files they need or to identify security risks from improperly managed sensitive data. AI-enhanced data classification is a new, highly efficient method, but best practices are still emerging here. Prepping for AI is the top business challenge for unstructured data management. There is a lot to figure out in this area from enriching metadata to add structure to the data so it is more usable to efficiently moving data to AI tools, to the whole gamut of security and privacy issues that arise from using GenAI tools in the workplace.
How do data security and governance factor into this year’s survey?
It’s a central part of nearly every data strategy today —from delivering reliable, fast data access, leveraging data in AI, meeting security and compliance requirements and keeping costs in check. AI data governance/security is the top future capability (47%) for unstructured data management, a marked increase from 2023 when it came in as the 6th highest priority from only 28% of participants. Nearly 60% say they need more staff with skills related to security, compliance and sensitive data. On the GenAI front, data leakage into tools is the topmost concern – and for good reason. Organizations can’t risk PII, trade secrets and customer data being exposed in public LLMs for the world to see. Given the increasingly widespread use of these tools in everyday work tasks, it’s a troubling situation. IT needs tools and policies to monitor data usage, segment sensitive data so that it cannot be ingested into generative AI and track outcomes for any inaccurate or nefarious results. These are early days for AI as the solutions aren’t quite there yet.
Cost optimization was the top data storage priority this year. What are some proven tactics to reduce costs in data storage?
We hear this from our customers all the time: data is growing too fast, they cannot afford to buy more storage as quickly as needed, and furthermore, they have a lack of visibility into their data across hybrid IT environment to make the best decisions that avoid waste. Overspending on data storage happens because IT organizations are storing most or all the data in the same way. New, active data belongs on the highest performing, most expensive storage. But that is typically 20% or less. The rest of it can be migrated or archived to lower cost storage tiers as it ages or as its value to the organization decreases. Another problem is the preponderance of duplicate data, orphaned data and zombie data which should be deleted altogether. This requires the ability to search across massive, distributed data estates and delete these data types. And of course, data that is not stored compliantly can incur regulatory fines—another preventable cost. A security breach can cost millions of dollars in cleanup and legal fees, which is why security needs to be baked into data storage and data management practices. Green data center technologies and green data management practices – delete what you don’t need and archive cold data to the cloud – also reduces energy usage and bills.
IT leaders had slightly different priorities for data storage. They care most about data access for the workforce followed closely by ease of data mobility. What are the challenges here?
IT executives are responsible for ensuring that the business gets what it needs to be productive and make intelligent decisions. Therefore, it makes sense that they are primarily concerned with performance and reliability for IT services – both data and applications. People should be able to find their data and search among directories quickly in a self-service manner and without disruption, per question #5. As for ease of data mobility, IT leaders know that they need to leverage different classes of storage on-premises and in the cloud to get the best price/performance for different workloads. Data can’t all be treated the same, and security and compliance (such as GDPR) make this even more relevant. There’s also the looming requirement to leverage unstructured data for AI. IT departments need tools and technologies to move large data sets quickly as needed, into the storage or AI platform of choice without risk of data loss or breach.