Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 100 Samples Per Regex / JSON Schema #1

Merged
merged 9 commits into from
Oct 21, 2024
Merged

Add 100 Samples Per Regex / JSON Schema #1

merged 9 commits into from
Oct 21, 2024

Conversation

lapp0
Copy link
Owner

@lapp0 lapp0 commented Oct 11, 2024

Changes

  • Adds 100 samples for each schema / pattern to src/samples/
  • data.py: Remove example key and replace with samples key
  • Update all src/benchmark_*.py ASV benchmark scripts to run 100 samples per benchmark

Caveat

We need to use RegexGuide.from_regex once dottxt-ai/outlines#1204 is merged and outlines version is bumped.

Sample Generation Scripts

phone_number.json

import random
import json

def generate_phone_number():
    # Generate 3 random digits, 3 random digits, and 4 random digits for the phone number
    area_code = f'{random.randint(100, 999)}'
    prefix = f'{random.randint(100, 999)}'
    line_number = f'{random.randint(1000, 9999)}'

    # Combine the parts into the format XXX-XXX-XXXX
    return f'{area_code}-{prefix}-{line_number}'

# Create a list of 100 phone numbers
phone_numbers = [generate_phone_number() for _ in range(100)]

print(json.dumps(phone_numbers))

url.json

import pandas as pd
import json

url = 'https://raw.githubusercontent.com/steciuk/SNA-reddit-bipartite-analysis/2fc2b2920ab1ff173ae457b4b1fcd490eb1aee16/data/posts_technews.csv'
df = pd.read_csv(url)

url_column_list = df['url'].tolist()

print(json.dumps(url_column_list[:100]))

gsm8k.json

from datasets import load_dataset
import json

dataset = load_dataset("thesven/gsm8k-reasoning", split="train")
dataset = dataset.map(lambda row: {"answer": row["answer"].split("<<")[0].split("=")[0].strip()})

gsm8k_thinking = dataset.select(range(100))["answer"]

print(json.dumps([gt + ". The answer is 42." for gt in gsm8k_thinking]))

complex_str.json

import random
import json


def random_string_from_pattern():
    # Define the patterns to choose from
    patterns = [
        r'(0|[1-9][0-9]*)',  # Integer pattern
        r'true',             # True boolean
        r'false',            # False boolean
        r'([a-zA-Z_][a-zA-Z_0-9]*)'  # Identifier pattern (letters, digits, underscore)
    ]

    # Randomly select one pattern
    selected_pattern = random.choice(patterns)

    # If it's the integer pattern, generate a random integer
    if selected_pattern == r'(0|[1-9][0-9]*)':
        return str(random.choice([0] + [random.randint(1, 100)]))

    # If it's the identifier pattern, generate a random identifier
    elif selected_pattern == r'([a-zA-Z_][a-zA-Z_0-9]*)':
        identifier_length = random.randint(1, 10)
        identifier = ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_', k=1))  # First character
        identifier += ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789', k=identifier_length - 1))
        return identifier

    # If it's true or false, just return the string 'true' or 'false'
    else:
        return selected_pattern


def generate_random_string(n):
    return ''.join(random_string_from_pattern() for _ in range(n))


data = [generate_random_string(random.randint(1, 10)) for _ in range(100)]
print(json.dumps(data))

long_integer.json

import random
import json


def random_long_number():
    first_digit = random.choice(range(1, 10))

    remaining_digits_length = random.randint(1, 14)
    remaining_digits = ''.join(random.choices('0123456789', k=remaining_digits_length))

    return f"+{first_digit}{remaining_digits}"


data = [random_long_number() for _ in range(100)]
print(json.dumps(data))

recording_schema.json and rpg_characters.json

import outlines
import json


JSON_SCHEMA = None  # TODO: Put schema here


qwen_model = outlines.models.transformers("Qwen/Qwen2.5-14B-Instruct", model_kwargs=dict(load_in_8bit=True))


def create_input(prompt):
    return qwen_model.tokenizer.tokenizer.apply_chat_template(
        [
            {"role": "system", "content": "You are a helpful AI assistant. You only speak English."},
            {"role": "user", "content": prompt}
        ],
        tokenize=False,
        add_generation_prompt=True,
    )


generator = outlines.generate.json(qwen_model, json.dumps(JSON_SCHEMA))


results = []
for _ in range(25):
    inputs = [
        create_input(f"For the schema\n\n{JSON_SCHEMA}\n\nThis is a valid json:\n")
        for _ in range(4)
    ]
    while True:
        try:
            results += generator(inputs, max_tokens=1000)
            break
        except Exception as e:
            print(_, e)

print(json.dumps(results))

@lapp0 lapp0 force-pushed the add-100-samples branch 12 times, most recently from 91c66eb to 80b549e Compare October 13, 2024 23:44
@rlouf
Copy link

rlouf commented Oct 15, 2024

I'm running the benchmarks locally. Would it be possible to upload the results folder as an Action artifact so we can inspect the results whenever there is an unexplained behavior?

@lapp0 lapp0 force-pushed the add-100-samples branch 4 times, most recently from cc7400b to c86e55d Compare October 21, 2024 04:58
@lapp0 lapp0 merged commit eaae9e4 into main Oct 21, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants