Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 100 Samples Per Regex / JSON Schema #35

Merged
merged 3 commits into from
Oct 21, 2024
Merged

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented Oct 11, 2024

Fixes #19

Changes

  • Adds 100 samples for each schema / pattern to src/samples/
  • data.py: Remove example key and replace with samples key
  • Update all src/benchmark_*.py ASV benchmark scripts to run 100 samples per benchmark

Caveat: We need to use RegexGuide.from_regex once dottxt-ai/outlines#1204 is merged and outlines version is bumped.

Sample Generation Scripts

phone_number.json

import random
import json

def generate_phone_number():
    # Generate 3 random digits, 3 random digits, and 4 random digits for the phone number
    area_code = f'{random.randint(100, 999)}'
    prefix = f'{random.randint(100, 999)}'
    line_number = f'{random.randint(1000, 9999)}'

    # Combine the parts into the format XXX-XXX-XXXX
    return f'{area_code}-{prefix}-{line_number}'

# Create a list of 100 phone numbers
phone_numbers = [generate_phone_number() for _ in range(100)]

print(json.dumps(phone_numbers))

url.json

import pandas as pd
import json

url = 'https://raw.githubusercontent.com/steciuk/SNA-reddit-bipartite-analysis/2fc2b2920ab1ff173ae457b4b1fcd490eb1aee16/data/posts_technews.csv'
df = pd.read_csv(url)

url_column_list = df['url'].tolist()

print(json.dumps(url_column_list[:100]))

gsm8k.json

from datasets import load_dataset
import json

dataset = load_dataset("thesven/gsm8k-reasoning", split="train")
dataset = dataset.map(lambda row: {"answer": row["answer"].split("<<")[0].split("=")[0].strip()})

gsm8k_thinking = dataset.select(range(100))["answer"]

print(json.dumps([gt + ". The answer is 42." for gt in gsm8k_thinking]))

complex_str.json

import random
import json


def random_string_from_pattern():
    # Define the patterns to choose from
    patterns = [
        r'(0|[1-9][0-9]*)',  # Integer pattern
        r'true',             # True boolean
        r'false',            # False boolean
        r'([a-zA-Z_][a-zA-Z_0-9]*)'  # Identifier pattern (letters, digits, underscore)
    ]

    # Randomly select one pattern
    selected_pattern = random.choice(patterns)

    # If it's the integer pattern, generate a random integer
    if selected_pattern == r'(0|[1-9][0-9]*)':
        return str(random.choice([0] + [random.randint(1, 100)]))

    # If it's the identifier pattern, generate a random identifier
    elif selected_pattern == r'([a-zA-Z_][a-zA-Z_0-9]*)':
        identifier_length = random.randint(1, 10)
        identifier = ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_', k=1))  # First character
        identifier += ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789', k=identifier_length - 1))
        return identifier

    # If it's true or false, just return the string 'true' or 'false'
    else:
        return selected_pattern


def generate_random_string(n):
    return ''.join(random_string_from_pattern() for _ in range(n))


data = [generate_random_string(random.randint(1, 10)) for _ in range(100)]
print(json.dumps(data))

long_integer.json

import random
import json


def random_long_number():
    first_digit = random.choice(range(1, 10))

    remaining_digits_length = random.randint(1, 14)
    remaining_digits = ''.join(random.choices('0123456789', k=remaining_digits_length))

    return f"+{first_digit}{remaining_digits}"


data = [random_long_number() for _ in range(100)]
print(json.dumps(data))

recording_schema.json and rpg_characters.json

import outlines
import json


JSON_SCHEMA = None  # TODO: Put schema here


qwen_model = outlines.models.transformers("Qwen/Qwen2.5-14B-Instruct", model_kwargs=dict(load_in_8bit=True))


def create_input(prompt):
    return qwen_model.tokenizer.tokenizer.apply_chat_template(
        [
            {"role": "system", "content": "You are a helpful AI assistant. You only speak English."},
            {"role": "user", "content": prompt}
        ],
        tokenize=False,
        add_generation_prompt=True,
    )


generator = outlines.generate.json(qwen_model, json.dumps(JSON_SCHEMA))


results = []
for _ in range(25):
    inputs = [
        create_input(f"For the schema\n\n{JSON_SCHEMA}\n\nThis is a valid json:\n")
        for _ in range(4)
    ]
    while True:
        try:
            results += generator(inputs, max_tokens=1000)
            break
        except Exception as e:
            print(_, e)

print(json.dumps(results))

TODO

  • Figure out why outlines-core is faster than outlines on regex and conversely on JSON.
  • Bump to outlines-core's latest release
  • Separate "compilation" (TTFT implies "at every run" and number of tokens / second after compilation.

@lapp0 lapp0 force-pushed the add-100-samples branch 3 times, most recently from a199b7c to 698809b Compare October 11, 2024 17:10
@rlouf rlouf marked this pull request as ready for review October 11, 2024 17:19
@lapp0 lapp0 force-pushed the add-100-samples branch 8 times, most recently from 91c66eb to 80b549e Compare October 13, 2024 23:44
for i in range(len(regex_example_tokens)):
_ = token_enforcer.get_allowed_tokens(regex_example_tokens[: i + 1])
for regex_sample in regex_samples:
regex_sample_tokens = self.tokenizer.encode(regex_sample)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this out of the timing method by pre-tokenizing the samples so we don't time this.

@rlouf
Copy link
Member

rlouf commented Oct 15, 2024

Given that the timings for OutlinesJSONSchema are in the tens of milliseconds, my suspicion is that the port of build_regex_from_schema to Rust in outlines-core is inefficient for some reason. Could you profile the run for JSON Schema and outlines-core to confirm this? Actually the first thing is to try is to compare the regexes that were generated by outlines and those currently generated by outlines-core.

Note that timings for this function on outlines-core are in the tens of microseconds. This is a mystery to me.

src/benchmark_lfe.py Outdated Show resolved Hide resolved
@rlouf
Copy link
Member

rlouf commented Oct 15, 2024

Ran the benchmarks locally with outlines-core==0.1.14 and the difference between Outlines and Outlines core is still mysterious (outlines core faster on regex but slower on json):

[58.33%] ··· benchmark_lfe.LMFormatEnforcerJsonSchema.time_lfe                                                                                                                                                                                                                         ok
[58.33%] ··· ===================================== =============== ======================
             --                                               json_schema_name           
             ------------------------------------- --------------------------------------
                             model                  RPG character   Simple nested schema 
             ===================================== =============== ======================
              NousResearch/Nous-Hermes-llama-2-7b     40.0±0.4μs          199±5μs        
                              gpt2                    40.8±0.9μs          216±3μs        
               NousResearch/Hermes-3-Llama-3.1-8B      192±5μs            289±1μs        
                 unsloth/gemma-2-2b-it-bnb-4bit        210±9μs            259±10μs       
             ===================================== =============== ======================

[66.67%] ··· benchmark_lfe.LMFormatEnforcerRegex.time_lfe                                                                                                                                                                                                                              ok
[66.67%] ··· ===================================== ============== ============ =========== ================ ==============
             --                                                                   regex_name                              
             ------------------------------------- -----------------------------------------------------------------------
                             model                  Phone Number      URL         GSM8K     Complex string   Long integer 
             ===================================== ============== ============ =========== ================ ==============
              NousResearch/Nous-Hermes-llama-2-7b    41.2±0.2ms     537±2ms     130±0.4ms     80.5±0.1ms     28.2±0.05ms  
                              gpt2                   12.9±0.1ms     401±5ms     204±0.9ms      486±10ms       7.74±0.2ms  
               NousResearch/Hermes-3-Llama-3.1-8B    18.9±0.2ms    4.72±0.08s    252±1ms      1.13±0.03s      27.9±0.3ms  
                 unsloth/gemma-2-2b-it-bnb-4bit       47.3±2ms     11.8±0.05s    289±7ms      2.14±0.06s      40.9±0.1ms  
             ===================================== ============== ============ =========== ================ ==============

[75.00%] ··· benchmark_outlines.OutlinesJsonSchema.time_outlines                                                                                                                                                                                                                       ok
[75.00%] ··· ===================================== =============== ======================
             --                                               json_schema_name           
             ------------------------------------- --------------------------------------
                             model                  RPG character   Simple nested schema 
             ===================================== =============== ======================
              NousResearch/Nous-Hermes-llama-2-7b     13.8±0.1ms         12.9±0.2ms      
                              gpt2                    15.9±0.1ms         17.1±0.7ms      
               NousResearch/Hermes-3-Llama-3.1-8B      81.3±1ms          75.6±0.9ms      
                 unsloth/gemma-2-2b-it-bnb-4bit        197±10ms           197±6ms        
             ===================================== =============== ======================

[83.33%] ··· benchmark_outlines.OutlinesRegex.time_outlines                                                                                                                                                                                                                            ok
[83.33%] ··· ===================================== ============== ============ ============ ================ ==============
             --                                                                   regex_name                               
             ------------------------------------- ------------------------------------------------------------------------
                             model                  Phone Number      URL         GSM8K      Complex string   Long integer 
             ===================================== ============== ============ ============ ================ ==============
              NousResearch/Nous-Hermes-llama-2-7b    81.4±0.6ms     170±2ms     8.28±0.04s      85.3±1ms       90.1±0.8ms  
                              gpt2                   112±0.8ms      235±7ms     15.3±0.05s      114±1ms         129±1ms    
               NousResearch/Hermes-3-Llama-3.1-8B     381±3ms       651±6ms     30.7±0.3s       380±4ms         426±4ms    
                 unsloth/gemma-2-2b-it-bnb-4bit       859±10ms     1.36±0.01s   1.05±0.01m      840±2ms         918±4ms    
             ===================================== ============== ============ ============ ================ ==============

[91.67%] ··· benchmark_outlines_core.OutlinesCoreJsonSchema.time_outlines_core                                                                                                                                                                                                         ok
[91.67%] ··· ===================================== =============== ======================
             --                                               json_schema_name           
             ------------------------------------- --------------------------------------
                             model                  RPG character   Simple nested schema 
             ===================================== =============== ======================
              NousResearch/Nous-Hermes-llama-2-7b     285±0.7ms           602±1ms        
                              gpt2                     403±2ms            846±2ms        
               NousResearch/Hermes-3-Llama-3.1-8B      958±20ms          1.77±0.03s      
                 unsloth/gemma-2-2b-it-bnb-4bit       1.99±0.01s         3.44±0.01s      
             ===================================== =============== ======================

[100.00%] ··· benchmark_outlines_core.OutlinesCoreRegex.time_outlines_core                                                                                                                                                                                                              ok[100.00%] ··· ===================================== ============== =========== ============ ================ ==============
              --                                                                   regex_name                              
              ------------------------------------- -----------------------------------------------------------------------
                              model                  Phone Number      URL        GSM8K      Complex string   Long integer 
              ===================================== ============== =========== ============ ================ ==============
               NousResearch/Nous-Hermes-llama-2-7b    81.5±0.2ms    143±0.3ms    5.49±0s       85.8±0.3ms      82.9±0.3ms  
                               gpt2                   100±0.3ms      189±3ms    10.7±0.01s      107±2ms         103±2ms    
                NousResearch/Hermes-3-Llama-3.1-8B     274±6ms       433±9ms    20.5±0.2s       286±7ms         284±7ms    
                  unsloth/gemma-2-2b-it-bnb-4bit       613±3ms       935±5ms    41.7±0.1s       647±6ms         634±5ms    
              ===================================== ============== =========== ============ ================ ==============

pyproject.toml Outdated
"outlines==0.0.46",
"outlines-core==0.1.0",
"lm-format-enforcer==0.10.7",
"outlines==0.1.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to compare to the Numba version, can you use an earlier version?

Copy link
Contributor Author

@lapp0 lapp0 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're no longer maintaining the Numba implementation of regex.py, wouldn't it make sense to reference the last benchmark run prior to replacement rather than continuously tracking it?

Outlines benchmarks: https://github.com/dottxt-ai/outlines/actions/runs/11079055001/job/30787437777

I could also perform a single run of this suite with the Numba implementation without merging it if that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now, we need the numbers for the outlines-core release. We’ll tag main once we’re happy with the setup, refer people to this tag for comparisons with Outlines and eventually remove it. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

outlines-core doesn't have caching. I assume you'd like me to use Outlines caching with outlines-core? (for now we can just copy https://github.com/dottxt-ai/outlines/blob/main/outlines/fsm/guide.py#L76-L99)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also let's use the latest version of outlines-core.

@lapp0 lapp0 force-pushed the add-100-samples branch 3 times, most recently from 08f10af to 0d73c1c Compare October 16, 2024 09:21
@lapp0
Copy link
Contributor Author

lapp0 commented Oct 16, 2024

  1. Updated to latest version of all three benchmarked packages.

  2. Fix absurdly low runtimes

  • Outlines: teardown() step to clear cache
  • lm-format-enforcer: teardown() step to delete TokenEnforcer and its contained cache
  • JsonSchema: Ensure "samples" is a list, not a generator which is exhausted prior to start of measured run.

By default ASV runs warmup steps prior to the measured run, resulting in the unexpected caching and generator exhaustion described above.

  1. Added "Upload Benchmark Results Folder" step to asv_benchmarks_pr.yaml (@rlouf should this be in asv_benchmark_main.yml as well?)

  2. Creating a separate PR to split up Time to First Token, and Tokens Per Second

Given that the timings for OutlinesJSONSchema are in the tens of milliseconds, my suspicion is that the port of build_regex_from_schema to Rust in outlines-core is inefficient for some reason. Could you profile the run for JSON Schema and outlines-core to confirm this? Actually the first thing is to try is to compare the regexes that were generated by outlines and those currently generated by outlines-core.

Seeing more sane benchmarks locally for a small subset. Will analyze the results of latest benchmark run first to ensure this is necessary.

@rlouf
Copy link
Member

rlouf commented Oct 16, 2024

A few comments:

  • Can you downgrade outlines to a version that used Numba, and use the latest version of outlines-core? I pushed to your branch
  • On PRs we should use the --quick flag of asv (asv run --quick) but keep it as is when merging on main I pushed to your branch
  • We need to increase the timeout for lm-format-enforcer I pushed to your branch
  • Benchmarks are currently failing for outlines and outlines-core
  • Timings look much more reasonable

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark method names, i.e. time_{package}, seem a little redundant. The package is already given by the class name, and exactly what's being timed isn't apparent. Can we change one of those so that it clarifies exactly what is being measured?

@lapp0
Copy link
Contributor Author

lapp0 commented Oct 17, 2024

Pushed a5adbe4 to fix benchmarks (sample run)

  • Note: time_lfe_total / time_lfe_runtime still fails due to timeout for "Simple nested schema" with unsloth/gemma-2-2b-it-bnb-4bit

Changes

  • Introduces new benchmarks:
    • time_{package}_first_token (time to first token)
    • time_{package}_runtime (time to generate all samples after first token)
    • time_{package}_total (renamed time_{package}, sum of first_token and runtime)
  • Refactored code to make it more clean, concise, applying DRY.
  • Ensure samples are tokenized in setup()

Benchmarks

For NousResearch/Nous-Hermes-llama-2-7b, (Long integer, Simple nested schema)

Parameter Method Benchmark Outlines Core Benchmark Outlines Benchmark LFE
Simple nested schema first_token 1.22s 2.82s 457μs
Simple nested schema runtime 47.3ms 18.5ms 8.40s
Simple nested schema total 1.28s 2.77s 8.58s
Long integer first_token 178ms 1.11s 930μs
Long integer runtime 5.06ms 1.96ms 38.1ms
Long integer total 180ms 1.11s 39.6ms

Edit

Pushed c86e55d which fixes a bug resulting in total and first_token benchmarks running twice.

@lapp0 lapp0 force-pushed the add-100-samples branch 4 times, most recently from cc7400b to c86e55d Compare October 21, 2024 04:58
@lapp0
Copy link
Contributor Author

lapp0 commented Oct 21, 2024

Just a heads-up: The main branch is currently pinned to outlines-core==0.1.0, which uses a different RegexGuide interface. This causes the PR benchmark tests to fail. However, after merging the benchmarks run, with the caveat that they often time out.

You can see the benchmark workflow run for asv_benchmark_main.yml here: https://github.com/lapp0/benchmarks/actions/runs/11433488618/job/31810480964

@rlouf
Copy link
Member

rlouf commented Oct 21, 2024

Everything works as intended, so I will merge this PR. I will do a follow-up PR to separate the outlines and outlines-core benchmarking code: not only does it seem to introduce extra benchmarking steps, we will soon remove outlines from these benchmarks.

@rlouf rlouf merged commit 0e02ffb into dottxt-ai:main Oct 21, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compare libraries when several sequences are generated
3 participants