synthetic_data_generator

A local Synthetic Data gen that generates synthetic training data using a LLM

Overview

This script is designed to convert bodies of text into a question and answer JSON format using the GPT-4 language model. The process involves extracting text from PDF files, tokenizing the text, generating questions and answers, and then saving the results in a JSON file.

Prerequisites

Required Python packages: langchain, PyPDF2, transformers, requests, pathlib, tqdm Setup Clone this repository to your local machine. Install the required packages by running: pip install -r requirements.txt Obtain an API token from Hugging Face Hub and set it as an environment variable:

export HUGGINGFACEHUB_API_TOKEN='your_api_token'

Usage

Place your PDF files in the specified folder (folder_path) that you want to process. Run the script: python convert_text_to_qa.py The script will perform the following steps: Extract text from PDF files. Tokenize the extracted text. Generate questions and answers using the GPT-4 language model. Save the generated Q&A pairs in a JSON file named responses.json.

Note

You can modify the model_path, folder_path, and other parameters in the script as needed. The script processes the text in chunks to manage memory usage. You can adjust the chunk size (256 in the example) based on your system's capabilities.

License

Contact

If you have any questions or suggestions, feel free to contact me at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
doc_parse.py		doc_parse.py
main.py		main.py
main_huggingface.py		main_huggingface.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

synthetic_data_generator

Overview

Prerequisites

Usage

Note

License

Contact

About

Releases

Packages

Languages

jehumtine/synthetic_data_generator

Folders and files

Latest commit

History

Repository files navigation

synthetic_data_generator

Overview

Prerequisites

Usage

Note

License

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages