allow generate_data logger parameter to overwrite locally defined loggers #449

khaledsulayman · 2024-12-11T16:42:50Z

Signed-off-by: Khaled Sulayman [email protected]

bbrowning · 2024-12-11T17:53:46Z

I'm a bit confused - what problem is this solving?

cdoern · 2024-12-11T18:08:29Z

@bbrowning this change allows the logger argument passed to generate_data to be useful outside of just generate_data. Without this, invoking generate_data with a logger just yields something like:

cat /Users/charliedoern/.local/share/instructlab/logs/generation/generation-2a9b8c55-8d30-4b8a-a661-6db7314ca8bf.log
2024-12-11 11:01:43,027 - instructlab.data.generate_data - INFO - Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:50824/v1 server
2024-12-11 11:01:43,421 - instructlab.data.generate_data - INFO - Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
gen_spellcheck Prompt Generation:   8%|         | 10/120 [00:17<04:52,  2.66s/it]%

With this change it now looks like:

WARNING 2024-12-11 13:05:45,468 instructlab.data.generate_data:161: Disabling SDG batching - unsupported with llama.cpp serving
INFO 2024-12-11 13:05:45,748 numexpr.utils:162: NumExpr defaulting to 16 threads.
INFO 2024-12-11 13:05:45,911 datasets:59: PyTorch version 2.4.1 available.
2024-12-11 13:05:46,545 - instructlab.data.generate_data - INFO - Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:51125/v1 server
INFO 2024-12-11 13:05:46,545 instructlab.data.generate_data:185: Generating synthetic data using 'full' pipeline, '/Users/charliedoern/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/Users/charliedoern/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:51125/v1 server
INFO 2024-12-11 13:05:46,902 instructlab.sdg.utils.taxonomy:148: Processing files...
INFO 2024-12-11 13:05:46,903 instructlab.sdg.utils.taxonomy:154: Pattern 'radio/kh1/kh1.md' matched 1 files.
INFO 2024-12-11 13:05:46,903 instructlab.sdg.utils.taxonomy:158: Processing file: /Users/charliedoern/.local/share/instructlab/datasets/documents-2024-12-11T13_05_46/knowledge_technology_radios_pw4ukaby/radio/kh1/kh1.md
INFO 2024-12-11 13:05:46,903 instructlab.sdg.utils.taxonomy:166: Appended Markdown content from /Users/charliedoern/.local/share/instructlab/datasets/documents-2024-12-11T13_05_46/knowledge_technology_radios_pw4ukaby/radio/kh1/kh1.md
2024-12-11 13:05:46,949 - instructlab.data.generate_data - INFO - Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-12-11 13:05:46,949 instructlab.data.generate_data:392: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-12-11 13:05:47,073 instructlab.sdg.checkpointing:59: No existing checkpoints found in /Users/charliedoern/.local/share/instructlab/datasets/checkpoints/knowledge_technology_radios, generating from scratch
INFO 2024-12-11 13:05:47,073 instructlab.sdg.pipeline:159: Running pipeline single-threaded
INFO 2024-12-11 13:05:47,073 instructlab.sdg.pipeline:203: Running block: duplicate_document_col
INFO 2024-12-11 13:05:47,078 instructlab.sdg.blocks.llmblock:55: LLM server supports batched inputs: False
INFO 2024-12-11 13:05:47,078 instructlab.sdg.pipeline:203: Running block: gen_spellcheck
gen_spellcheck Prompt Generation:   0%|          | 0/120 [00:00<?, ?it/s]/Users/charliedoern/Documents/instructlab/venv/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(
gen_spellcheck Prompt Generation:   6%|         | 7/120 [00:08<03:26,  1.83s/it]^C^C
Aborted!
gen_spellcheck Prompt Generation:   6%|         | 7/120 [00:10<02:55,  1.55s/it]

cdoern

I feel like this might warrant a proper logging package inside of SDG to manage external loggers wanting to pass in settings, but this resoles the issue. Maybe a few comments to explain why we do the whole LOGGER thing?

…gers Signed-off-by: Khaled Sulayman <[email protected]>

cdoern · 2024-12-11T18:26:25Z

oh I like this version even more, so you only need to override the logger in generate_data, and everywhere else you just remove __name__ and use the root logger?

Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman · 2024-12-11T18:53:11Z

we found the issue that was not causing the proposed logging configs to propagate down so this change is not needed

mergify bot added the ci-failure label Dec 11, 2024

cdoern approved these changes Dec 11, 2024

View reviewed changes

mergify bot added the one-approval label Dec 11, 2024

allow generate_data logger parameter to overwrite locally defined log…

81fad3c

…gers Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman force-pushed the propagate-logger branch from fcf0d89 to 81fad3c Compare December 11, 2024 18:17

mergify bot added ci-failure and removed ci-failure labels Dec 11, 2024

khaledsulayman force-pushed the propagate-logger branch from 75305ae to 243b5a1 Compare December 11, 2024 18:23

mergify bot removed the ci-failure label Dec 11, 2024

khaledsulayman added 2 commits December 11, 2024 13:30

remove __name__ from logging.getLogger() calls to use root logger

fe891e0

Signed-off-by: Khaled Sulayman <[email protected]>

always use root logger

c9371d1

Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman force-pushed the propagate-logger branch from 243b5a1 to c9371d1 Compare December 11, 2024 18:32

mergify bot added the ci-failure label Dec 11, 2024

cdoern added the hold In-progress PR. Tag should be removed before merge. label Dec 11, 2024

khaledsulayman closed this Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow generate_data logger parameter to overwrite locally defined loggers #449

allow generate_data logger parameter to overwrite locally defined loggers #449

khaledsulayman commented Dec 11, 2024

bbrowning commented Dec 11, 2024

cdoern commented Dec 11, 2024

cdoern left a comment

cdoern commented Dec 11, 2024

khaledsulayman commented Dec 11, 2024

allow generate_data logger parameter to overwrite locally defined loggers #449

allow generate_data logger parameter to overwrite locally defined loggers #449

Conversation

khaledsulayman commented Dec 11, 2024

bbrowning commented Dec 11, 2024

cdoern commented Dec 11, 2024

cdoern left a comment

Choose a reason for hiding this comment

cdoern commented Dec 11, 2024

khaledsulayman commented Dec 11, 2024