Synthetic data generation company Gretel last week announced it has donated more than 100,000 examples of text-to-SQL conversions and parked them on Huggingface, providing enterprises with another free, open source resource for building generative AI applications.
Analytics departments in businesses speak Structured Query Language, but the GenAI revolution is occurring with unstructured data–predominantly text but also images–and bridging the gap between natural language and the SQL dialect is not always easy.
Enterprises have reams of pertinent data stashed away in millions of tables sitting in data warehouses, but getting access to this information requires the appropriate SQL query, and converting natural language into SQL as part of a GenAI application isn’t straightforward or easy.
For instance, a manager seeking more detail on sales might ask “What was the total revenue generated from credit card transactions in the last quarter, broken down by product category?” That may sound simple enough, but there could be several ways to convert that question into a SQL query, some of which are correct and some that aren’t.
That’s the basic impetus behind the decision by Gretel–a five-year-old San Diego company specializing in tools for creating synthetic data–to open source a synthetic data set comprised of more than 100,000 examples of text-to-SQL conversions.
Alex Watson, co-founder and chief product officer at Gretel, says dataset will help companies use GenAI to derive insights from complex databases, data warehouses, and data lakes, without needing to learn SQL or rely on technical teams.
“Access to quality training data is one of the biggest obstacles to building with generative AI,” Watson says in a press release. “By providing developers with high-quality, synthetic text-to-SQL data, we’re enabling them to create AI models that can understand natural language queries and generate SQL queries.”
The text-to-SQL samples include metadata and span over 100 verticals, making them useful for companies in all sorts of industries for training Large Language Models (LLMs) . They are available on Huggingface under a permissive Apache 2.0 license. Users can also work with them within Gretel Navigator, the company’s enterprise offering for creating and managing synthetic data content.
For example, for the natural language query, “What are the names and prices of electronic products under $500, sorted from highest to lowest price?” the open source dataset includes the following SQL query:
SELECT product_name, price
FROM products
WHERE category = ‘Electronics’ AND price < 500
ORDER BY price DESC;
“A data scientist can use these text-to-SQL samples to train or fine-tune AI models,” says Gretel Chief Scientist Yev Meyer. “By feeding the model with paired examples of natural language queries and corresponding SQL code, the model learns to map between the two and generalize and generate SQL code for queries that the model has not even seen yet.”
Gretel isn’t the first outfit to share a large sample of text-to-SQL samples. The company points out that Yale University’s Language, Information, and Learning at Yale (LILY) Lab created the Spider dataset, which is comprised of 7,000 text-to-SQL examples across a variety of domains.
However, Spider required 11 university students to work a total of 1,000 hours to complete, “an incredible amount of effort for a relatively small dataset in the context of large language models,” Meyer says. (LILY says to keep an eye out for Spider 2.0, which is due soon and will provide text-to-SQL for the LLM age.)
The Spider dataset’s copyleft license also poses challenges to wider adoption, which is one reason Gretel chose the permissive Apache 2.0 license for its data set.
“Our dataset is the largest and most diverse open source dataset of its kind,” Meyer says. “Other open source text-to-SQL datasets are much smaller (reducing their utility) or their licensing comes with strings attached. Releasing this massive dataset under the Apache 2.0 license gives AI developers the freedom to build whatever they want with it. We’re excited to see where it goes!”
To access Gretel’s text-to-SQL dataset on Huggingface, click here. To read Meyer’s blog post about the text-to-SQL dataset, click here.
Related Items:
IBM Patents a Faster Method to Train LLMs for Enterprises
What’s Holding Up the ROI for GenAI?
What Will 2024 Bring to Advance Analytics?