Fine-tuning Instruct Models with Raw Text Data
Economically Efficient Chatbot Training with Minimal Dialogue Data for Under $10
Summary: This experiment explores a cost-effective and user-friendly approach to fine-tuning Mistral's 7B Instruct v0.2 model on a dataset of 1.6 million tokens from The Guardian's manage-frontend repository. The method is particularly suitable for software developers without deep learning engineering experience.
1. Objective and Method
1.1 Objective:
The experiment aims to find a balance between the complexity of fine-tuning models with billions of tokens and a 128K context window. The goal is to achieve domain adaptation economically while maintaining model performance.
This method is highlighted for its cost-effectiveness and ease of use, especially valuable for developers with limited resources who want to enhance their chatbot's performance without deep learning engineering experience.
1.2 Method:
We fine-tuned Mistral's 7B Instruct v0.2 model on The Guardian's manage-frontend repository, approximately 1.6 million tokens.
We emphasized repeatable guidelines for cost-effective model fine-tuning using readily available hardware, aiming to minimize trial and error and maximize efficiency using raw text data instead of labeled dialogue data.
2. Training Resources and Libraries
2.1 Resources:
Nvidia A100 40GB and H100 80GB were used for training to ensure efficiency and reliability.
2.2 Libraries:
The Unsloth library was used for speed and memory efficiency during training. The SFTtrainer from the trl library served as a wrapper for the HuggingFace trainer to prepare the dataset for self-supervised training.
3. Dataset Creation
3.1 Composition:
The original dataset included the repository's wiki, a snapshot of the main branch, and the last 100 pull request comments and code changes to ensure diversity and richness.
Unlike existing fine-tuning methods that rely on a large amount of labeled dialogue data, our approach focuses more on the use of raw text data, reducing reliance on labeled dialogue data. This provides a new perspective on how to fine-tune effectively with limited data.
3.2 Data Scraping:
Wiki data was scraped by copying and pasting each wiki page into text files, and Python scripts were written to run locally, scraping the code repository and writing all files into text files.
3.3 Synthetic Dialogue Data:
Labeled dialogue data was generated using the GPT-4 Turbo API and specific instructions, which, although synthetic, helped mitigate catastrophic forgetting and performance degradation. This offers a viable strategy for enhancing model memory and performance.
4. Model Training and Hyperparameters
4.1 Hyperparameter Selection:
Optimal hyperparameters affecting model performance, such as LoRA levels, batch size, and learning rate, were determined through manual search. A learning rate of 2e-5 was found to be optimal, which seems to be the standard for fine-tuning Mistral.
4.2 Optimization Process:
After finding the optimal hyperparameters, training was rerun to include all data, a common practice.
5. Results and Analysis
5.1 Performance Improvement:
The results exceeded expectations, with the fine-tuned model accurately answering complex questions related to the codebase and incorporating patterns from the `manage-frontend` repository into responses to questions involving JavaScript and Typescript. The actual performance of the fine-tuned model, demonstrated through a Gradio app, surpassed expectations, validating the effectiveness of the fine-tuning method and providing other developers with a way to visualize and experimentally verify the fine-tuning effects.
5.2 Catastrophic Forgetting:
Performance on non-code-related questions, such as incorrect speed units, declined for the fine-tuned model. Catastrophic forgetting was lighter than expected, but a clear difference still existed between the fine-tuned model and the base model.
6. Text Generation Strategy
6.1 Deterministic Approach:
A deterministic approach was used in text generation, selecting the most likely next word or sequence of words using greedy search. This strategy ensured the quality of generated text while reducing uncertainty in the generation process.
7. Why These Hyperparameters Performed Best
7.1 Batch Size and Variability:
A lower batch size introduced more variability and noise into the gradient estimates, allowing the optimizer to respond more dynamically to specific features of data points.
7.2 LoRA Levels and Learning Capacity:
Higher LoRA levels increased the model's ability to learn task-specific details, while lower levels led to more forgetting. Higher LoRA levels provided more trainable parameters, allowing the model to learn details of new data more "intelligently," while a level of 2048 allowed the model to deviate too much from its valuable pre-trained knowledge.
Integrating these core points, the experimental report comprehensively presents a cost-effective and user-friendly fine-tuning method and the potential of using limited raw text data and synthetically generated dialogue data to enhance the performance of chatbots. Additionally, the results and practical application sections underscore the effectiveness of the fine-tuning method, while the text generation strategy section highlights the advantages of using a deterministic approach.
#TheGuardian #Chatbot #ModelFineTuning #DeepLearning #SoftwareDevelopment