Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface Trainer closes run automatically after training #1663

Open
Ulipenitz opened this issue Feb 26, 2024 · 3 comments
Open

Huggingface Trainer closes run automatically after training #1663

Ulipenitz opened this issue Feb 26, 2024 · 3 comments
Assignees

Comments

@Ulipenitz
Copy link

Ulipenitz commented Feb 26, 2024

Is your feature request related to a problem? Please describe.

When I use a Huggingface Trainer with a NeptuneCallback, it seems that the Trainer closes the run automatically & thus disconnects it from the python logger.
If I want to log anything to Neptune after training, I have to reinitialize the run, which makes the code complex in bigger training pipelines.

Describe the solution you'd like

Would be great if the run persists.

Describe alternatives you've considered

My workaround looks like this:

main.py:

from dotenv import find_dotenv, load_dotenv
import logging
import neptune
from neptune.integrations.python_logger import NeptuneHandler
from training_function import training_function

def setup_main_logger(run, run_id):
    logger = logging.getLogger()  # Get the root logger
    logger.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    run, neptune_handler = get_neptune_handler(run, run_id, formatter)
    logger.addHandler(neptune_handler)
    return run, logging.getLogger(__name__)

def get_neptune_handler(run, run_id, formatter):
    try:
        run.stop()
    finally:
        run = neptune.init_run(with_id=run_id, capture_stderr=True, capture_stdout=True)
    neptune_handler = NeptuneHandler(run=run)
    neptune_handler.setFormatter(formatter)
    return run, neptune_handler

if __name__ == "__main__":

    # load ENV variables
    load_dotenv(find_dotenv(), override=True)
    NEPTUNE_API_TOKEN = os.environ.get("NEPTUNE_API_TOKEN")
    NEPTUNE_PROJECT = os.environ.get("NEPTUNE_PROJECT")

    # Initialize Neptune run
    run = neptune.init_run(capture_stderr=True, capture_stdout=True)
    run_id = run["sys/id"].fetch()

    # Set up logging
    run, logger = setup_main_logger(run, run_id)
    ...
    logger.info("This logs perfectly to Neptune! ")
    training_function(..., run)
    logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")
    run, logger = setup_main_logger(run, run_id)
    logger.info("This logs perfectly to Neptune! ")

training_function.py:

from transformers.integrations import NeptuneCallback
from transformers import Trainer
import logging

logger = logging.getLogger()  # root logger

def training_function(..., run) -> None:
    ...
    # Create neptune callback for training logs
    neptune_callback = NeptuneCallback(
        run=run,
        log_parameters=True,
        log_checkpoints="all",
        )
    
    logger.info("This logs perfectly to Neptune! ")
    # Initialize the trainer using our model, training args & dataset, and train
    trainer = Trainer(
        model=model,
        args=args,
        ...
        callbacks=[neptune_callback],
    )
    logger.info("This logs perfectly to Neptune! ")
    trainer.train()
    logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")
@SiddhantSadangi SiddhantSadangi self-assigned this Feb 26, 2024
@SiddhantSadangi
Copy link
Member

Hey @Ulipenitz 👋
Neptune does indeed automatically stop the run once the training loop is done. However, we do provide multiple options to log additional metadata to the run once training is over.
Here is our Transformers integration guide that lists these options 👉 https://docs.neptune.ai/integrations/transformers/#logging-additional-metadata-after-training

Please let me know if any of these work for you 🤗

@SiddhantSadangi SiddhantSadangi added the pending Waiting for a response label Feb 26, 2024
@Ulipenitz
Copy link
Author

Thanks for the answer @SiddhantSadangi!
This is indeed useful to log metadata like test metrics after training.
My problem though is that I need to set up the python logger again after the training function.
I am training on a remote machine in the cloud & unfortunately capture_stderr=True, capture_stdout=True only captures neptune specific logs, but I want to have all logs in neptune, including the python logger.
My proposed workaround with calling setup_main_logger works, but I think it is not a nice solution.

@SiddhantSadangi SiddhantSadangi added feature request and removed pending Waiting for a response labels Feb 27, 2024
@SiddhantSadangi
Copy link
Member

SiddhantSadangi commented Feb 27, 2024

Ah, understood!
Yes, this is definitely inconvenient.

I think your workaround does handle this pretty well in the absence of official support for this use case. I'll just suggest using neptune_callback's get_run() method to access the run used by the Transformer callback. This will remove the need for storing the run_id and reinitializing the run.

trainer = Trainer(
    ...
    callbacks=[neptune_callback],
)

logger.info("This will be logged to Neptune")

trainer.train()

logger.info("This won't be logged to Neptune")

run = neptune_callback.get_run(trainer)
neptune_handler = NeptuneHandler(run=run)
logger.addHandler(neptune_handler)
logger.info("This will be logged to Neptune")

Please let me know if this workaround works better for you 🙏

I will also pass this feedback to the product team ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants