Currently, most discussions around data-driven AI are centred around structured and unstructured data. This has become more pronounced since Large Language Models (LLMs) such as GPT-4 entered the scene. When we talk about structured data, we refer to data with a specific structure— such as databases and spreadsheets. These are highly organized information systems with a clear format; think of it as being able to find what you need in a neatly arranged closet. Examples of structured data would be Excel tables and CSV files that make searching for specific data easy due to their fixed data types. On the other hand, unstructured data do not follow any particular guidelines or rules since they lack a predefined schema. They can take any form — text, images or videos — just about anything you can think of. Surprisingly, unstructured data, which appears to be chaotic, is the most common data type used to convey information!
Language models like LLMs are inherently more comfortable with unstructured text data and find it difficult to work with structured data— which makes them stumble on a special set of obstacles. For structured data, it is a necessity for the information to be delivered in an intelligible way, but for LLMs, this means going through a conversion process that typically hampers both the effectiveness and accuracy of their tasks. This type of data often contains numbers without context-based meaning alongside specific fields whose details must be explicitly understood, an area where LLMs struggle as they rely more on recognizing patterns rather than following definite rules. In addition, handling structured data involves responding to complex queries as well as establishing various connections between different elements, like what is commonly seen in SQL systems, and this is not something that comes naturally for LLMs.
APIs play a central role here, as the go-between that facilitates the exchange of information between different software systems. Through APIs, structured data can be ingeniously transformed on the fly into forms that LLMs can digest: reports, summaries full of key points, or even engaging chat interactions.
Data Lakehouse’s are a possible solution to address these challenges by combining the storage capabilities available in Data Lakes and the analytical capabilities of a Data Warehouse. They require that there is no need to manage different types of information separately to provide comprehensive datasets to LLMs. They turn structured information into forms necessary for LLM, which increases their productivity. Furthermore, most existing Data Lakehouse’s come with AI search functionality to improve searchability and analysis. This ensures efficient handling of huge volumes of data through scalable processing and advanced analytics tools that can be utilized while maximizing LLM performance and data usage.
There is yet another creative method that involves using generative AI to make unstructured data out of structured data. The idea is that text can be generated based on structured data. For example, an LLM model can be trained to create detailed descriptions from the tabular data, which can then be used to convey the context more clearly.
Large language models are at their most valuable when dealing with unstructured data. This is because they possess the capability of natural language processing which enables them to directly make sense of such data without any pre-defined structure. Instead, they extract and understand the context in which the information is presented— allowing them to take even the subtlest details into account and consequently produce more precise and relevant responses.
Possible solutions and example
To make structured data usable in a Chatbot system that uses RAG and LLMs, an extra architectural component is needed. This additional component should incorporate Data Conversion and Integration Modules using APIs to retrieve structured data from databases and then transform it into narratives or summaries in natural language.
A use case is for example a financial chatbot; it can access a customer’s transaction history through an API linked to a well-structured database system. This data conversion module converts such details into a format that is easily usable for conversation by LLM in generating feedback through this interaction design technique. The combination thus guarantees the ability of the chatbot to provide correct and contextually appropriate responses, benefiting from the use of two types of information sources (structured data plus generative capabilities of LLM).
Conclusion
In summary, although structured data presents challenges for LLMs, appropriate preparation and situational awareness can still result in insights. Conversely, unstructured data is well-suited to LLMs’ strengths, allowing for flexible and contextually rich analysis. In brief, gated information is one thing; a path into the wild is quite another. Effectively managing both data types is important for using the full potential of generative AI.
The author: Maximilian Kuhn – AI Engineer Consultant HICO-Group