Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow generate_data logger parameter to overwrite locally defined loggers #449

Closed

Conversation

khaledsulayman
Copy link
Member

Signed-off-by: Khaled Sulayman [email protected]

@mergify mergify bot added the ci-failure label Dec 11, 2024
@bbrowning
Copy link
Contributor

I'm a bit confused - what problem is this solving?

@cdoern
Copy link
Contributor

cdoern commented Dec 11, 2024

@bbrowning this change allows the logger argument passed to generate_data to be useful outside of just generate_data. Without this, invoking generate_data with a logger just yields something like:

cat /Users/charliedoern/.local/share/instructlab/logs/generation/generation-2a9b8c55-8d30-4b8a-a661-6db7314ca8bf.log
2024-12-11 11:01:43,027 - instructlab.data.generate_data - INFO - Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:50824/v1 server
2024-12-11 11:01:43,421 - instructlab.data.generate_data - INFO - Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
gen_spellcheck Prompt Generation:   8%|         | 10/120 [00:17<04:52,  2.66s/it]%

With this change it now looks like:

WARNING 2024-12-11 13:05:45,468 instructlab.data.generate_data:161: Disabling SDG batching - unsupported with llama.cpp serving
INFO 2024-12-11 13:05:45,748 numexpr.utils:162: NumExpr defaulting to 16 threads.
INFO 2024-12-11 13:05:45,911 datasets:59: PyTorch version 2.4.1 available.
2024-12-11 13:05:46,545 - instructlab.data.generate_data - INFO - Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:51125/v1 server
INFO 2024-12-11 13:05:46,545 instructlab.data.generate_data:185: Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:51125/v1 server
INFO 2024-12-11 13:05:46,902 instructlab.sdg.utils.taxonomy:148: Processing files...
INFO 2024-12-11 13:05:46,903 instructlab.sdg.utils.taxonomy:154: Pattern 'radio/kh1/kh1.md' matched 1 files.
INFO 2024-12-11 13:05:46,903 instructlab.sdg.utils.taxonomy:158: Processing file: /Users/charliedoern/.local/share/instructlab/datasets/documents-2024-12-11T13_05_46/knowledge_technology_radios_pw4ukaby/radio/kh1/kh1.md
INFO 2024-12-11 13:05:46,903 instructlab.sdg.utils.taxonomy:166: Appended Markdown content from /Users/charliedoern/.local/share/instructlab/datasets/documents-2024-12-11T13_05_46/knowledge_technology_radios_pw4ukaby/radio/kh1/kh1.md
2024-12-11 13:05:46,949 - instructlab.data.generate_data - INFO - Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-12-11 13:05:46,949 instructlab.data.generate_data:392: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-12-11 13:05:47,073 instructlab.sdg.checkpointing:59: No existing checkpoints found in /Users/charliedoern/.local/share/instructlab/datasets/checkpoints/knowledge_technology_radios, generating from scratch
INFO 2024-12-11 13:05:47,073 instructlab.sdg.pipeline:159: Running pipeline single-threaded
INFO 2024-12-11 13:05:47,073 instructlab.sdg.pipeline:203: Running block: duplicate_document_col
INFO 2024-12-11 13:05:47,078 instructlab.sdg.blocks.llmblock:55: LLM server supports batched inputs: False
INFO 2024-12-11 13:05:47,078 instructlab.sdg.pipeline:203: Running block: gen_spellcheck
gen_spellcheck Prompt Generation:   0%|          | 0/120 [00:00<?, ?it/s]/Users/charliedoern/Documents/instructlab/venv/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(
gen_spellcheck Prompt Generation:   6%|         | 7/120 [00:08<03:26,  1.83s/it]^C^C
Aborted!
gen_spellcheck Prompt Generation:   6%|         | 7/120 [00:10<02:55,  1.55s/it]

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might warrant a proper logging package inside of SDG to manage external loggers wanting to pass in settings, but this resoles the issue. Maybe a few comments to explain why we do the whole LOGGER thing?

@mergify mergify bot added the one-approval label Dec 11, 2024
@cdoern
Copy link
Contributor

cdoern commented Dec 11, 2024

oh I like this version even more, so you only need to override the logger in generate_data, and everywhere else you just remove __name__ and use the root logger?

@mergify mergify bot added the ci-failure label Dec 11, 2024
@cdoern cdoern added the hold In-progress PR. Tag should be removed before merge. label Dec 11, 2024
@khaledsulayman
Copy link
Member Author

we found the issue that was not causing the proposed logging configs to propagate down so this change is not needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure hold In-progress PR. Tag should be removed before merge. one-approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants