Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vllm start, stop and use random port for vllm #96

Merged
merged 2 commits into from
Oct 18, 2024

Conversation

sallyom
Copy link
Collaborator

@sallyom sallyom commented Oct 16, 2024

See Issue #95

This PR:

  • cleans up the launch, stop vLLM definitions - and updates to find open port to serve on. The launch, stop vLLM definitions are still duplicated in 2 files, this is currently necessary with the standalone script.
  • calls ilab's wait_for_stable_vram - following how ilab shuts down vllm between evals

@sallyom sallyom changed the title update stop vllm definition to import ilab shutdown_process [wip] update stop vllm definition to import ilab shutdown_process Oct 16, 2024
Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us know when this is not wip anymore and open for review :), thanks!

@sallyom sallyom force-pushed the fix-bound-port-vllm branch 5 times, most recently from fbd9f5f to 459a22d Compare October 17, 2024 05:07
@sallyom sallyom changed the title [wip] update stop vllm definition to import ilab shutdown_process Update vllm start, stop and use random port for vllm Oct 17, 2024
@sallyom sallyom force-pushed the fix-bound-port-vllm branch from 459a22d to f3c3329 Compare October 17, 2024 05:20
@sallyom
Copy link
Collaborator Author

sallyom commented Oct 17, 2024

@leseb please review - I am testing with the pipeline, I haven't run with standalone yet - also waiting on this full run output, if it fails will fix tomorrow!

leseb
leseb previously approved these changes Oct 17, 2024
@leseb
Copy link
Collaborator

leseb commented Oct 17, 2024

@leseb please review - I am testing with the pipeline, I haven't run with standalone yet - also waiting on this full run output, if it fails will fix tomorrow!

Feel free to merge when you're done testing, but the code LGTM. Thanks!

@sallyom sallyom force-pushed the fix-bound-port-vllm branch from f3c3329 to c8a4c45 Compare October 17, 2024 12:21
# Rename the best model directory to "candidate_model" for the next step
# So we know which model to use for the final evaluation
os.rename(
os.path.join(models_path_prefix, best_model),
os.path.join(models_path_prefix, "candidate_model"),
)
best_model = f"{models_path_prefix}/candidate_model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use os.path.join instead of string formatting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why don't we print the "sample_" instead? If we print "candidate_model", the user has no idea which model was the best I guess...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this - but we rename it to candidate_model so samples_ no longer exists, right? What's saved to S3 is "candidate_model" right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the output a bit, ptal

@leseb
Copy link
Collaborator

leseb commented Oct 17, 2024

test run here

pipeline.py Show resolved Hide resolved
Copy link
Collaborator

@MichaelClifford MichaelClifford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -193,13 +193,16 @@ def shutdown_vllm(process: subprocess.Popen, timeout: int = 20):
os.path.join(models_path_prefix, best_model),
os.path.join(models_path_prefix, "candidate_model"),
)
best_model = f"{models_path_prefix}/candidate_model"
best_model_renamed = os.path.join(models_path_prefix, "candidate_model")
best_model_output = f"Candidate model: {best_model} located at {best_model_renamed}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to be a message? Later on we use this to print the best model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job logs:

INFO 2024-10-18 11:29:26,271 __main__:2433: Job completed successfully.
INFO 2024-10-18 11:29:26,792 __main__:956: Best model: /data/model/output/phase_2/hf_format/samples_320
INFO 2024-10-18 11:29:26,792 __main__:3140: Running final evaluation.

Pod log:

{
"best_model": "/data/model/output/phase_2/hf_format/samples_320",
"best_score": 8.5
}

Later on for final eval, we know the model has been renamed to candidate_model so everything is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the script uses the Pod logs to print the best model, the content of best_model should not be a sentence.

Copy link
Collaborator Author

@sallyom sallyom Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but the pipeline uses the output to find the model for final eval - for now, I've hard-coded that value as ../hf_format/candidate_model in the pipeline and removed the updated output here.

@sallyom sallyom force-pushed the fix-bound-port-vllm branch 2 times, most recently from 43f83c0 to c02bf63 Compare October 17, 2024 16:39
@leseb leseb dismissed their stale review October 18, 2024 12:30

3rd commit should not be included or reworked

@sallyom sallyom force-pushed the fix-bound-port-vllm branch from c02bf63 to 4697fb1 Compare October 18, 2024 13:34
@MichaelClifford MichaelClifford merged commit 2b43a33 into opendatahub-io:main Oct 18, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants