Understanding unstructured data in the context of AI

Tuesday December 3, 2024. 10:00 AM , from InfoWorld

The volume of data being created today is truly staggering. IDC projects that global data will reach 400 zettabytes (400 billion terabytes) by 2028, with 90% of that data classified as unstructured data. The proliferation of so much data, and so much unstructured data, raises two primary questions: How do we manage it all, and how can we use it to power the next generation of AI applications?

Here, we’ll dive into what unstructured data is, how today’s leading organizations leverage it to power business, and what we can expect as the amount of data continues to grow exponentially.

A primer on different data types

The traditional form of data that most people are familiar with is structured data, which fits snugly into table-based formats. For years, structured data has been the foundation of traditional database systems and data management due to its inherently organized form of data storage and retrieval.

A step above its traditional predecessor is semi-structured data, which arrived in response to the rigidity of table-based formats. Semi-structured data retains some organizational elements of structured data but removes the traditional tabular constraints. This type of data drove the growth and popularity of NoSQL databases such as Cassandra, MongoDB and Redis, which were designed to manage more flexible data structures.

This brings us to unstructured data, which has overwhelmingly become the most common type of data. As its name signifies, unstructured data can come in any form or format, varies widely in size, and creates complex semantic relationships. Thus, unstructured data requires a much different approach to processing and management.

Taking a deeper look at semantic complexity, consider three different photos of the same object. Even though the raw data behind each of these photos could vary widely — file size, number of pixels, resolution, and so on — their semantic meaning is the same. Therein lies the challenge with modern data management. What’s the best way to store, search, and analyze content not based on their technical characteristics but on their meaning?

The many categories of unstructured data

Unstructured data comes in two main flavors: human-generated and machine-generated. Examples of human-generated unstructured data include:

Text messages: Most of us write text messages using informal language, such as abbreviations (GTFO!) and emojis.

Emails: While typically more formal than text messages, emails often contain a mix of semi-structured data, such as to and from fields, and free-form text and images or attachments.

Social media posts: Content on social media platforms varies widely in structure and content depending on the medium used (Facebook versus LinkedIn versus Twitter/X, for example).

Handwritten notes: Notes written by hand are one of the original forms of unstructured data and can include text, diagrams, drawings, or other visual elements.

Audio recordings: Audio can take many forms of unstructured data, including voice mails, phone calls, audio notes, and other types of audio files.

Transcripts: Interviews, meetings, phone calls, and speeches can all be transcribed, each with various levels of accuracy.

Images: Visual data can include photographs, diagrams, charts, illustrations, and screenshots, each potentially containing multiple layers of information from facial expressions to text overlays to complex scenes.

Videos: Video content combines visual, audio, and often textual elements (like captions or overlays) into complex unstructured data that can range from short mobile clips to professional productions to surveillance footage.

Unstructured data can be machine-generated as well, with examples including:

IoT data: As the number of smart, internet-connected devices grows, so does the amount of data they create and collect.

Sensor data: Similarly, the number and kinds of sensors that collect data continues to grow, such as motion sensors, GPS sensors, temperature sensors, and much more.

Machine log data: This type of data could include system logs, application logs, and event logs.

Natural language processing (NLP) data: Speech recognition, language translation, and sentiment analysis technologies all produce unstructured data.

Web and app data: Web and mobile apps produce a wide variety of unstructured data, including performance data, user data and error logs.

The above lists certainly aren’t exhaustive, so it’s easy to see how and why unstructured data has come to dominate our universe.

What unstructured data means for data management

The differences between structured and unstructured data mean that traditional database systems and modern AI database systems handle information in different ways. Consider a task like organizing books in a library. For traditional databases — structured data — a search would involve looking for a specific book that labeled everything clearly: the title, author name, publication date, and so on. If you wanted to find all of the books written by Stephen King, all you would need to do is search the author catalog and find the exact matches.

This is a basic representation of how traditional databases work, a precise and predictable way to find exact matches.

Meanwhile, when it comes to modern AI databases — unstructured data — a search would involve finding books that are similar to your favorite. Instead of considering the author or title, you would need to take other factors like writing style and content into account. This type of search is more subjective and relies more on “feel” rather than black-or-white exact matches.

This is how modern AI databases work with the types of unstructured data mentioned above. Instead of looking for exact matches, these databases look for results that are similar or “close enough.”

The key difference in this example is that traditional databases use the old-school library catalog to find exact matches. If you search for “Stephen King,” you’ll get a list of only Stephen King’s books. On the flip side, AI databases are more like asking someone to recommend books that are similar to those written by Stephen King, whether it’s the tone, the writing style, the subject matter, or some other characteristic. In this scenario, you may receive a strong suggestion, but they may not be “perfect” matches.

The approach with modern AI databases is a balancing act. If you spend more time searching, and consider more properties of the books (tone, style, subject, etc.), you’ll get more accurate results — but the process will be slower.

What it all means going forward

The continued explosion of unstructured data makes managing it an increasingly crucial challenge for all types of organizations to overcome. Unstructured data will outpace structured data at a staggering rate, and the organizations that can best understand and interact with it will give themselves a leg up on the competition. Navigating this new paradigm requires businesses to implement and fully utilize tools that allow them to extract value from their data assets.

The key word here is seamlessness. Those who seamlessly manage both structured and unstructured data set themselves up to bridge the gap between raw data and insights that meaningfully drive the business forward. The era of 400 zettabytes is coming. Ultimately, distinguishing between unstructured and structured data will take a back seat to organizations’ ability to effectively derive value from it.

James Luan is the VP of Engineering at Zilliz and creator of the open source vector database Milvus. James has a master’s degree in computer engineering from Cornell University and extensive experience as a database engineer at Oracle, Hedvig, and Alibaba Cloud. He played a crucial role in developing HBase, Alibaba Cloud’s open-source database, and Lindorm, a self-developed NoSQL database. He is a respected member of the technical advisory committee of the LF AI & Data Foundation, contributing his expertise to shaping the future of AI and data technologies.

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.