- Added the self-correction support in BaseExperiment.
- Modularized the repository as a package for quick replication. Currently added for SyntheticCopa. Other experiment objects for Synthetic SuperGLUE will be shortly added.
This is the official repository of the paper: TarGEN: Targeted Data Generation with Large Language Models
-Step 1: Import Packages & Add API_KEYS in the
.env
file that should be created in the <ROOT_DIRECTORY>:
The ability to control model objects has also been introduced in the
main.py
script abstracting away from internal classes to provide flexibility.
Any langchain-supported API can be used for experimentation.
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from TarGEN import Generate
from experiments.copa import SyntheticCopa, SyntheticCb
load_dotenv(<ROOT_DIRECTORY>)
API_KEY = os.getenv("OPEN_AI_KEY")
TARGET_DATA_STYLE = "COPA"
# Load Model
openai_llm = ChatOpenAI(openai_api_key=API_KEY)
- Step 2: In the experiments directory add the configs for all the steps:
Important
Check out sample SyntheticCopa class which extends BaseExperiment class as a systematic way to enforce required methods.
- For added flexibility, this file can also override the
generator_function
in case of changes in logic in the generator function. - This file requires defining the pydantic object class of the instance sample,
where each field and description is explicitly mentioned. There are a few global
variables that will be available during runtime for the generator to access such as
DOMAIN, N, SENTENCE
etc. Changes to global variables are frozen and require changes in the BaseExperiment class. In the next update, we will provide accessibility as a local runtime variable that can be configured as the prompt engineer requires.
**- Step 3: Load experiment object in main.py
Once the experiment object has been designed as detailed in step 2, it has to be loaded in the runtime.
# Load orchestrator
if TARGET_DATA_STYLE == "COPA":
experiment_object = SyntheticCopa(model=openai_llm)
else:
experiment_object = BaseExperiment(llm=openai_llm)
- Step 4: The generator only requires the experiment object.
The
create_synthetic_data()
method will orchestrate the generation of samples
based on the get_config()
method defined in the experiment-specific class:
targen = Generate(experiment_object=experiment_object)
targen.create_synthetic_data(output_path="outputs/copa_sample.json",
n_samples=8,
overwrite=True,
num_instance_seeds=1
)
@article{gupta2023targen,
title={TarGEN: Targeted Data Generation with Large Language Models},
author={Gupta, Himanshu and Scaria, Kevin and Anantheswaran, Ujjwala and Verma, Shreyas and Parmar, Mihir and Sawant, Saurabh Arjun and Mishra, Swaroop and Baral, Chitta},
journal={arXiv preprint arXiv:2310.17876},
year={2023}
}