Huggingface Trainer closes run automatically after training #1663

Ulipenitz · 2024-02-26T17:38:13Z

Is your feature request related to a problem? Please describe.

When I use a Huggingface Trainer with a NeptuneCallback, it seems that the Trainer closes the run automatically & thus disconnects it from the python logger.
If I want to log anything to Neptune after training, I have to reinitialize the run, which makes the code complex in bigger training pipelines.

Describe the solution you'd like

Would be great if the run persists.

Describe alternatives you've considered

My workaround looks like this:

main.py:

from dotenv import find_dotenv, load_dotenv
import logging
import neptune
from neptune.integrations.python_logger import NeptuneHandler
from training_function import training_function

def setup_main_logger(run, run_id):
    logger = logging.getLogger()  # Get the root logger
    logger.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    run, neptune_handler = get_neptune_handler(run, run_id, formatter)
    logger.addHandler(neptune_handler)
    return run, logging.getLogger(__name__)

def get_neptune_handler(run, run_id, formatter):
    try:
        run.stop()
    finally:
        run = neptune.init_run(with_id=run_id, capture_stderr=True, capture_stdout=True)
    neptune_handler = NeptuneHandler(run=run)
    neptune_handler.setFormatter(formatter)
    return run, neptune_handler

if __name__ == "__main__":

    # load ENV variables
    load_dotenv(find_dotenv(), override=True)
    NEPTUNE_API_TOKEN = os.environ.get("NEPTUNE_API_TOKEN")
    NEPTUNE_PROJECT = os.environ.get("NEPTUNE_PROJECT")

    # Initialize Neptune run
    run = neptune.init_run(capture_stderr=True, capture_stdout=True)
    run_id = run["sys/id"].fetch()

    # Set up logging
    run, logger = setup_main_logger(run, run_id)
    ...
    logger.info("This logs perfectly to Neptune! ")
    training_function(..., run)
    logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")
    run, logger = setup_main_logger(run, run_id)
    logger.info("This logs perfectly to Neptune! ")

training_function.py:

from transformers.integrations import NeptuneCallback
from transformers import Trainer
import logging

logger = logging.getLogger()  # root logger

def training_function(..., run) -> None:
    ...
    # Create neptune callback for training logs
    neptune_callback = NeptuneCallback(
        run=run,
        log_parameters=True,
        log_checkpoints="all",
        )
    
    logger.info("This logs perfectly to Neptune! ")
    # Initialize the trainer using our model, training args & dataset, and train
    trainer = Trainer(
        model=model,
        args=args,
        ...
        callbacks=[neptune_callback],
    )
    logger.info("This logs perfectly to Neptune! ")
    trainer.train()
    logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")

The text was updated successfully, but these errors were encountered:

SiddhantSadangi · 2024-02-26T17:44:17Z

Hey @Ulipenitz 👋
Neptune does indeed automatically stop the run once the training loop is done. However, we do provide multiple options to log additional metadata to the run once training is over.
Here is our Transformers integration guide that lists these options 👉 https://docs.neptune.ai/integrations/transformers/#logging-additional-metadata-after-training

Please let me know if any of these work for you 🤗

Ulipenitz · 2024-02-27T16:44:10Z

Thanks for the answer @SiddhantSadangi!
This is indeed useful to log metadata like test metrics after training.
My problem though is that I need to set up the python logger again after the training function.
I am training on a remote machine in the cloud & unfortunately capture_stderr=True, capture_stdout=True only captures neptune specific logs, but I want to have all logs in neptune, including the python logger.
My proposed workaround with calling setup_main_logger works, but I think it is not a nice solution.

SiddhantSadangi · 2024-02-27T18:01:15Z

Ah, understood!
Yes, this is definitely inconvenient.

I think your workaround does handle this pretty well in the absence of official support for this use case. I'll just suggest using neptune_callback's get_run() method to access the run used by the Transformer callback. This will remove the need for storing the run_id and reinitializing the run.

trainer = Trainer(
    ...
    callbacks=[neptune_callback],
)

logger.info("This will be logged to Neptune")

trainer.train()

logger.info("This won't be logged to Neptune")

run = neptune_callback.get_run(trainer)
neptune_handler = NeptuneHandler(run=run)
logger.addHandler(neptune_handler)
logger.info("This will be logged to Neptune")

Please let me know if this workaround works better for you 🙏

I will also pass this feedback to the product team ✅

SiddhantSadangi self-assigned this Feb 26, 2024

SiddhantSadangi added the pending Waiting for a response label Feb 26, 2024

SiddhantSadangi added feature request and removed pending Waiting for a response labels Feb 27, 2024

Raalsky mentioned this issue Feb 28, 2024

Fix bug with passing capture_* args to neptune callback huggingface/transformers#29041

Merged

5 tasks

SiddhantSadangi assigned parthpankajtiwary and unassigned SiddhantSadangi Feb 28, 2024

SiddhantSadangi assigned parthpankajtiwary and unassigned parthpankajtiwary Apr 3, 2024

SiddhantSadangi assigned AurimasGr and unassigned parthpankajtiwary Aug 12, 2024

SiddhantSadangi assigned dzwiedziu and unassigned AurimasGr Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huggingface Trainer closes run automatically after training #1663

Huggingface Trainer closes run automatically after training #1663

Ulipenitz commented Feb 26, 2024 •

edited

Loading

SiddhantSadangi commented Feb 26, 2024

Ulipenitz commented Feb 27, 2024

SiddhantSadangi commented Feb 27, 2024 •

edited

Loading

Huggingface Trainer closes run automatically after training #1663

Huggingface Trainer closes run automatically after training #1663

Comments

Ulipenitz commented Feb 26, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

SiddhantSadangi commented Feb 26, 2024

Ulipenitz commented Feb 27, 2024

SiddhantSadangi commented Feb 27, 2024 • edited Loading

Ulipenitz commented Feb 26, 2024 •

edited

Loading

SiddhantSadangi commented Feb 27, 2024 •

edited

Loading