This application is built using Streamlit and helps users create a properly formatted Jsonl file. This file format is needed to fine-tune an OpenAi chat model. This application is meant to be used locally please see our other version for a remote use.
data = {
"messages": [
{"role": "system", "content": system_message},
{"role": "user", "content": prompt_text},
{"role": "assistant", "content": ideal_generated_text}
]
}
- User Prompts: Enter a question under the label "Enter your question? Human:".
- AI Response: Provide your ideal AI-generated response.
- Custom System Message: Add a custom system message or stick with the default message "You are a helpful and friendly assistant.".
- Data Saving: Upon pressing the "Accept Inputs" button, the provided data gets formatted and appended to an
output.jsonl
file. - TRAINING_FILE_ID Input: Users can input their TRAINING_FILE_ID required for fine-tuning.
- Fine-Tuning: A button to send the
output.jsonl
file to OpenAI for fine-tuning. - Chat Window: Test the fine-tuned model by sending messages and viewing the model's response.
- Clone the repository using:
git clone https://github.com/raymondbernard/finetuneopenai.git
- Navigate to the repository directory:
cd finetuneopenai
- Install the required packages using:
pip install -r requirements.txt
- Run the Streamlit app using:
streamlit run app.py
- streamlit
- jsonlines
- tiktoken
- numpy
- requests
- python-dotenv
- streamlit_extras
Feel free to raise issues, provide feedback, or make contributions to improve the application.
This project is licensed under the MIT License. See LICENSE
for more information.
openaicheck.py
is a script designed to inspect and validate the structure of a dataset for chat completions. It performs the following operations:
- Data Inspection: The script initially loads the dataset from
output.jsonl
and prints the number of examples and the first example to provide an overview. - Format Error Checks: The script checks for various formatting issues such as:
- Incorrect data types
- Missing message lists
- Unrecognized message keys
- Missing content
- Unrecognized roles in messages
- Absence of an assistant's message
- Token Count: It calculates the number of tokens for each message and provides distribution statistics such as:
- Range (Min and Max)
- Average (Mean)
- Middle Value (Median)
- 5th Percentile
- 95th Percentile
- Number of Messages per Example Distribution: Provides statistics about the number of messages in each example.
- Total Tokens per Example Distribution: Indicates the total number of tokens in each example.
- Assistant Tokens per Example Distribution: Pertains to the number of tokens in the assistant's messages within each example. For each distribution, the following statistics are provided:
- Range: The smallest and largest values.
- Average (Mean): The average value.
- Middle Value (Median): The middle value when sorted.
- 5th Percentile: 5% of the data lies below this value.
- 95th Percentile: 95% of the data lies below this value.
OpenAI Blog: GPT-3.5 Turbo, Fine-Tuning, and API Updates
MIT License
(c) 2023 Ray Bernard
This is a simple guide to get started with the fine-tune OpenAI Chat model. Please, do not hesitate to open an issue if you encounter any problem or have a suggestion.