Update vllm start, stop and use random port for vllm #96

sallyom · 2024-10-16T01:50:40Z

See Issue #95

This PR:

cleans up the launch, stop vLLM definitions - and updates to find open port to serve on. The launch, stop vLLM definitions are still duplicated in 2 files, this is currently necessary with the standalone script.
calls ilab's wait_for_stable_vram - following how ilab shuts down vllm between evals

leseb

Let us know when this is not wip anymore and open for review :), thanks!

sallyom · 2024-10-17T05:27:59Z

@leseb please review - I am testing with the pipeline, I haven't run with standalone yet - also waiting on this full run output, if it fails will fix tomorrow!

leseb · 2024-10-17T10:20:45Z

@leseb please review - I am testing with the pipeline, I haven't run with standalone yet - also waiting on this full run output, if it fails will fix tomorrow!

Feel free to merge when you're done testing, but the code LGTM. Thanks!

leseb · 2024-10-17T12:38:29Z

eval/mt_bench/components.py

    # Rename the best model directory to "candidate_model" for the next step
    # So we know which model to use for the final evaluation
    os.rename(
        os.path.join(models_path_prefix, best_model),
        os.path.join(models_path_prefix, "candidate_model"),
    )
+    best_model = f"{models_path_prefix}/candidate_model"


please use os.path.join instead of string formatting.

Also, why don't we print the "sample_" instead? If we print "candidate_model", the user has no idea which model was the best I guess...

I thought about this - but we rename it to candidate_model so samples_ no longer exists, right? What's saved to S3 is "candidate_model" right?

I updated the output a bit, ptal

leseb · 2024-10-17T12:41:13Z

test run here

pipeline.py

MichaelClifford

LGTM

leseb · 2024-10-17T15:05:58Z

eval/mt_bench/components.py

@@ -193,13 +193,16 @@ def shutdown_vllm(process: subprocess.Popen, timeout: int = 20):
        os.path.join(models_path_prefix, best_model),
        os.path.join(models_path_prefix, "candidate_model"),
    )
-    best_model = f"{models_path_prefix}/candidate_model"
+    best_model_renamed = os.path.join(models_path_prefix, "candidate_model")
+    best_model_output = f"Candidate model: {best_model} located at {best_model_renamed}"


Why does it have to be a message? Later on we use this to print the best model.

Job logs:

INFO 2024-10-18 11:29:26,271 __main__:2433: Job completed successfully. INFO 2024-10-18 11:29:26,792 __main__:956: Best model: /data/model/output/phase_2/hf_format/samples_320 INFO 2024-10-18 11:29:26,792 __main__:3140: Running final evaluation.

Pod log:

{ "best_model": "/data/model/output/phase_2/hf_format/samples_320", "best_score": 8.5 }

Later on for final eval, we know the model has been renamed to candidate_model so everything is fine.

Because the script uses the Pod logs to print the best model, the content of best_model should not be a sentence.

true, but the pipeline uses the output to find the model for final eval - for now, I've hard-coded that value as ../hf_format/candidate_model in the pipeline and removed the updated output here.

3rd commit should not be included or reworked

Signed-off-by: sallyom <[email protected]>

sallyom changed the title ~~update stop vllm definition to import ilab shutdown_process~~ [wip] update stop vllm definition to import ilab shutdown_process Oct 16, 2024

leseb reviewed Oct 16, 2024

View reviewed changes

sallyom force-pushed the fix-bound-port-vllm branch 5 times, most recently from fbd9f5f to 459a22d Compare October 17, 2024 05:07

sallyom changed the title ~~[wip] update stop vllm definition to import ilab shutdown_process~~ Update vllm start, stop and use random port for vllm Oct 17, 2024

sallyom force-pushed the fix-bound-port-vllm branch from 459a22d to f3c3329 Compare October 17, 2024 05:20

leseb previously approved these changes Oct 17, 2024

View reviewed changes

sallyom force-pushed the fix-bound-port-vllm branch from f3c3329 to c8a4c45 Compare October 17, 2024 12:21

leseb reviewed Oct 17, 2024

View reviewed changes

sallyom commented Oct 17, 2024

View reviewed changes

pipeline.py Show resolved Hide resolved

MichaelClifford approved these changes Oct 17, 2024

View reviewed changes

leseb reviewed Oct 17, 2024

View reviewed changes

sallyom force-pushed the fix-bound-port-vllm branch 2 times, most recently from 43f83c0 to c02bf63 Compare October 17, 2024 16:39

sallyom added 2 commits October 18, 2024 09:33

update stop vllm to use ilab/ilab and rename best_model

9d45526

Signed-off-by: sallyom <[email protected]>

wait for final_eval_task to complete b4 deleting pvcs

4697fb1

Signed-off-by: sallyom <[email protected]>

sallyom force-pushed the fix-bound-port-vllm branch from c02bf63 to 4697fb1 Compare October 18, 2024 13:34

leseb approved these changes Oct 18, 2024

View reviewed changes

MichaelClifford merged commit 2b43a33 into opendatahub-io:main Oct 18, 2024
1 check passed

leseb mentioned this pull request Oct 22, 2024

Clean up launch, stop vLLLM code #95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update vllm start, stop and use random port for vllm #96

Update vllm start, stop and use random port for vllm #96

sallyom commented Oct 16, 2024 •

edited

Loading

leseb left a comment

sallyom commented Oct 17, 2024

leseb commented Oct 17, 2024

leseb Oct 17, 2024

leseb Oct 17, 2024

sallyom Oct 17, 2024

sallyom Oct 17, 2024

leseb commented Oct 17, 2024

MichaelClifford left a comment

leseb Oct 17, 2024

leseb Oct 18, 2024

leseb Oct 18, 2024

sallyom Oct 18, 2024 •

edited

Loading

Update vllm start, stop and use random port for vllm #96

Update vllm start, stop and use random port for vllm #96

Conversation

sallyom commented Oct 16, 2024 • edited Loading

leseb left a comment

Choose a reason for hiding this comment

sallyom commented Oct 17, 2024

leseb commented Oct 17, 2024

leseb Oct 17, 2024

Choose a reason for hiding this comment

leseb Oct 17, 2024

Choose a reason for hiding this comment

sallyom Oct 17, 2024

Choose a reason for hiding this comment

sallyom Oct 17, 2024

Choose a reason for hiding this comment

leseb commented Oct 17, 2024

MichaelClifford left a comment

Choose a reason for hiding this comment

leseb Oct 17, 2024

Choose a reason for hiding this comment

leseb Oct 18, 2024

Choose a reason for hiding this comment

leseb Oct 18, 2024

Choose a reason for hiding this comment

sallyom Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

sallyom commented Oct 16, 2024 •

edited

Loading

sallyom Oct 18, 2024 •

edited

Loading