Deep learning AI models, such as GenAI chatbots, possess an insatiable appetite for data. These models need data for training purposes so they can be effective for real-world scenarios.
It can be challenging, in terms of effort, compliance, and cost, to provide AI models with this vast volume of data and to ensure quality, relevance, and diversity of data. What if we could feed AI models with synthetic data for training purposes?
That is exactly what IBM plans on doing. The tech giant wants to use synthetic data to feed AI’s massive appetite. It is seeking to patent a system for “synthetic data generation” where it creates a simulation of authentic data from real users. It will deploy an innovative method, called Large-Scale Alignment for Chatbots (LAB), which will systematically generate synthetic data for the tasks that developers want their chatbot to accomplish.
The effectiveness of the AI model is heavily reliant on the data it is trained on. IBM realized that one of the bottlenecks for rapid AI development is the need for accurate and representative data for training models.
Training models can be pricey and time-consuming, and can often require dedicated resources. The LAB method can drastically lower costs and the time typically associated with training LLMs. It does this by continually assimilating new knowledge and capabilities into the model without overwriting what the model already learned. This can create an abundance of clean and processed data to train the AI models.
The new data generation method is based on taxonomy – classification of data into categories and subcategories. IBM’s taxonomy works by segregating instruction data into three overarching categories: knowledge, foundational skills, and compositional skills.
The taxonomy maps out existing skills and knowledge of the chatbot and highlights gaps that need to be filled. This system enables LLM developers to specify desired knowledge and skills for their chatbots.
A second LLM, referred to as a teacher model, formulates instructions based on a question-answer framework tailored to the task. The teacher model aims to further refine the simulation by generating instructions for each category while maintaining quality control. This graduated training approach enables the AI model to progressively build upon its existing knowledge base, similar to human learning progression.
“Instruction data is the lever for building a chatbot that behaves the way you want it to,” said Akash Srivastava, chief architect of LLM alignment at IBM Research. “Our method allows you to write a recipe for the problems you want your chatbot to solve and to generate instruction data to build that chatbot.”
One of the key benefits of using synthetic data is the added privacy. Using real data for training has the inherent risk of spitting that exact personal data back out if prompted in a specific way. With synthetic data, you can mirror real human behaviors, interactions, and choices, without violating user privacy.
While synthetic data for AI models offers several benefits, it comes with its own set of risks. While you want the synthetic data to closely mimic human behavior, if it actually mimics an actual user’s data too closely, then it could be a problem, especially in industries like healthcare and finance.
To test the LAB method, IBM Research generated a synthetic dataset with 1.2 million instructions and used that data to train two open-source LLMs. The results show that both LLMs performed on par or better with the state-of-the-art chatbots on a wide range of benchmarks. IBM also used the synthetic data to improve its own enterprise-focused Granite models on IBM watsonx.
According to IBM, two distinguishing traits contributed to these impressive results. Firstly, it is the ability of the teacher model to generate synthetic examples from each leaf node of the taxonomy, allowing for broader coverage of target tasks.
Secondly, the LAB method allows new skills and knowledge to be added to the base LLM without having to incorporate this information into the teacher model as well. “This means you don’t need some all-powerful teacher model that distills its capabilities into the base model,” said David Cox, vice president for AI models at IBM Research.
IBM’s patent also highlights that there could be a rise in demand for AI services, and it could be just as lucrative as building AI itself. It won’t be surprising if IBM uses this patent to support enterprises that are building their own AI models, offering a less resource-intensive method compared to collecting authentic user data.
Related Items
Why A Bad LLM Is Worse Than No LLM At All
The Human Touch in LLMs and GenAI: Shaping the Future of AI Interaction
Beyond the Moat: Powerful Open-Source AI Models Just There for the Taking