Mastering Text – to – SQL with OpenAI GPT – 4 Mini for Data Analysis

Mastering Text – to – SQL with OpenAI GPT – 4 Mini for Data Analysis







Project Goal and Key Conclusion

This project demonstrates that fine-tuning open-source large language models (LLMs) like Meta’s Llama 3.1 8B Instruct and Alibaba’s Qwen 2.5 series can produce expert-level text-to – SQL performance tailored to a specific database schema. After more than 60 training sessions totaling over 1600 hours on a high-end RTX 4090 system, the author found it possible to achieve SQL query generation quality comparable to paid services such as Grok or Perplexity, but within an open-source and privacy-focused environment. The project confirms that with careful dataset curation, advanced reinforcement learning techniques, and efficient fine-tuning methods, open-source models can handle complex SQL tasks involving temporal analysis, self joins, and running totals.

Fine Tuning

Fine-Tuning Models for Text-to – SQL Tasks. The core of this project involved fine-tuning LLMs specifically for translating natural language questions into SQLite-compatible SQL queries. The author started with Meta’s Llama 3.1 8B Instruct model, known for its instruction-following capabilities, then experimented with Alibaba’s Qwen 2.5 variants, including Qwen2.5-Coder – 7B-Instruct, optimized for code generation tasks. By switching back and forth between these models, the project evaluated their relative strengths in handling complex SQL generation. Fine-tuning emphasized difficult operations such as self joins, temporal calculations using SQLite functions like julianday, and detecting recurring patterns with window functions like RANK() OVER. The final goal was to create a model that could rival commercial AI services but remain open source, which is critical for private or closed ecosystems where data security is paramount.

Step Guide

Advanced Methods Including Guided Reward Policy Optimization. A standout method used in this project was Guided Reward Policy Optimization (GRPO), a reinforcement learning approach that refines model outputs based on multiple reward signals. Unlike traditional supervised fine-tuning, GRPO encourages exploration and balances adherence to a reference policy using a KL divergence penalty. Key reward functions included: – Format Reward: Ensures output follows a structured reasoning and SQL tagging format, scoring between 0 and 1. – SQL Correctness Reward: Compares execution results of generated SQL against ground truth using the sqlglot parser, ensuring syntactic and semantic accuracy. – Complexity Reward: Matches query complexity (token length and SQL operations) to the gold standard. – Reasoning Quality Reward: Assesses clarity and logical structure in the reasoning output. Hyperparameters were carefully adjusted, with learning rates ranging from 1e-6 to 4e-5, KL penalties (beta) between 0.01 and 0.1, and epochs from 3 to 10, recommending at least 12 epochs for convergence. The training leveraged libraries like trl for GRPOTrainer, unsloth for GPU optimization, peft for LoRA-based low-rank adaptation, and bitsandbytes for 4-bit/8-bit quantization to fit models into 14-20 GB VRAM efficiently.

Efficient Fine

Efficient Fine-Tuning Using LoRA Adapters. To keep training times and resource usage manageable, the project employed LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning technique. Instead of updating all model weights, LoRA adds low-rank adapters with ranks between 8 and 32 to attention modules, training only about 20 million parameters rather than billions. This approach shortened training runs to between 2 and 72 hours and kept GPU memory usage under 15 GB for most sessions on the RTX

4090. This efficiency made it feasible to iterate quickly over multiple experiments without access to massive compute clusters, a significant advantage for applied research in text-to – SQL tasks.

Dataset Construction Focused on Complexity and Realism

Dataset quality and relevance were essential to the project’s success. The author initially used the b-mc2/sql-create – context dataset from Hugging Face, containing 300-500 examples for rapid prototyping. Later, the full dataset (~10, 000 rows) was validated down to 5, 020 examples, focusing eventually on 616 high-complexity queries rated as complexity level

3. These difficult queries involved advanced SQL concepts like category counting with running totals (quarterly aggregates using strftime), temporal sequence analysis (date differences using julianday), and recurrence pattern detection (RANK() OVER).

Prompts included schema context, natural language questions, gold SQL queries, and complexity labels to guide training. The evaluation set consisted of 10 carefully designed test queries against a synthetic call center database (“wandsworth_callcenter_sampled.db”), split evenly between 5 easy/medium and 5 hard queries. This set tested syntactic validity (executable queries), semantic correctness (matching ground truth results), and AI-assisted semantic equivalence via Grok’s judgments, providing a comprehensive performance snapshot.

Evaluation Metrics and Benchmarking Model Performance

The project used multiple quantitative metrics to evaluate model outputs rigorously. These included: – Syntactic Validity Score (SVS): Measures whether the generated SQL runs without errors in SQLite. – Ground Truth Semantic Correctness Score (GTSCS): Checks if the query results match the official gold standard output. – AI Semantic Correctness Score (AISCS): Uses AI judgment (via Grok) to assess semantic equivalence even if SQL structure differs. – Composite Precision Score (CPS): An average of SVS, GTSCS, and AISCS to capture overall accuracy. These metrics were applied to the 10-query evaluation set, balancing easy and hard problems, ensuring that the models were tested on realistic, domain-specific SQL challenges rather than synthetic or overly simplified tasks.

Hardware Setup and Software Environment Details

Training was conducted on a high-end gaming workstation equipped with an NVIDIA RTX 4090 GPU featuring 24GB VRAM, running under Windows 11 with the Windows Subsystem for Linux 2 (WSL2) Ubuntu 22.04 environment. CUDA 12.1 and PyTorch 2.2.0 were used for GPU acceleration. The software stack included transformers 4.43.0, datasets 2.20.0, sqlglot 25.1.0, and other dependencies listed in the project’s requirements.txt. Some challenges, like CUDA version mismatches and library incompatibilities, were resolved by downgrading specific packages (e.g., trl to 0.8.6) to ensure smooth integration. The use of quantization libraries such as bitsandbytes allowed fitting large models into available GPU memory without sacrificing much accuracy, enabling longer sequence handling up to 4, 096 tokens to accommodate complex schema contexts.

Real World

Real-World Motivation Behind the Project. This project was motivated by practical considerations rather than purely academic curiosity. Many companies operate with proprietary databases and require domain-specific AI tools that respect data privacy. Commercial text-to – SQL services often involve sending sensitive queries to external servers, which is unacceptable in closed ecosystems. By fine-tuning open-source models on a single database schema, the author sought to create a domain expert LLM that can run locally or within secure environments. This approach offers customization, privacy, and cost advantages over subscription-based AI services. Furthermore, the project addresses the question of feasibility: How many training hours are needed?

Can open-source models match or exceed paid alternatives?

The detailed methodology and upcoming articles promise answers grounded in empirical data.

Summary and Next Steps for Interested Readers

In summary, this project reveals that open-source LLMs like Llama 3.1 8B Instruct and Qwen 2.5 can be successfully fine-tuned using advanced reinforcement learning (GRPO) and efficient parameter tuning (LoRA) to generate complex, accurate SQL queries from natural language. The rigorous evaluation framework and curated datasets ensure that results are meaningful and applicable to real-world database querying needs. For readers interested in the detailed hardware setup and final quantitative results, the author plans two follow-up articles: one on machine configuration and one presenting the comprehensive outcomes and key takeaways. This series offers a valuable roadmap for practitioners seeking to build privacy-conscious, expert-level text-to – SQL models tailored to their databases. With President Donald Trump in office as of November 2024, the landscape of AI regulation and enterprise adoption may evolve, making open-source, locally deployable AI solutions increasingly relevant for privacy and sovereignty concerns. This project aligns well with the trend toward self-hosted AI capabilities in sensitive environments.

Leave a Reply