-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve OOM when reading large logs in webserver #45079
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Yes. That's exactly how I envisioned solving this problem. @dstandish ? |
FYI. Breaking changes to FileTaskHandler is not a problem - we can work out back-compatibility or simply break it for Airflow 3 - this is not a big deal, since this is only a deployment configuration and does not require DAG adaptations. |
Hi @potiuk, Would it be okay if I treat this issue as an umbrella issue to track other TODO tasks while refactoring each provider? Or would it be more preferable to refactor |
Sure. It can be separate set of PRs and that issue can remain "umbrella" - you do not need to have more issues. PRs are enough |
IIRC this should be fine when task done but may present challenges when task is in flight because at any moment the location of the logs may shift eg from worker to remote storage etc |
Is it not the same case now? |
Related issue #31105 |
Taking From my perspective, for the |
Yep. There will be dga cases like that. And yes the proposed method is good. |
Description
Related context: #44753 (comment)
TL;DR
After conducting some research and implementing a POC, I would like to propose a potential solution. However, this solution requires changes to the
airflow.utils.log.file_task_handler.FileTaskHandler
. If the solution is accepted, it will necessitate modifications to 10 providers that extend theFileTaskHandler
class.Main Concept for Refactoring
The proposed solution focuses on:
The POC for this refactoring shows a 90% reduction in memory usage with similar processing times!
Experiment Details
Main Root Causes of OOM
_interleave_logs
Function inairflow.utils.log.file_task_handler
records
list.records
list._read
Method inairflow.utils.log.file_task_handler.FileTaskHandler
_read
:These methods read the entire log content and return it as a string instead of a generator:
_read_from_local
_read_from_logs_server
_read_remote_logs
(Implemented by providers)Proposed Refactoring Solution
The main concept includes:
heapq
with streams of logs.Breaking Changes in This Solution
Interface of the
read
Method inFileTaskHandler
:Interfaces of
read_log_chunks
andread_log_stream
inTaskLogReader
:Methods That Use
_read
_read_from_local
_read_from_logs_server
_read_remote_logs
( there are 10 providers implement this method )Experimental Environment:
830 MB
, about8670000
linesBenchmark Metrics
Original Implementation:
POC (Refactored Implementation):
Summary
Feel free to share any feedback! I believe we should have more discussions before adopting this solution, as it involves breaking changes to the
FileTaskHandler
interface and requires refactoring in 10 providers as well.Related issues
#44753
TODO Tasks
_read_remote_logs
get_task_log
CeleryKubernetesExecutor
KubernetesExecutor
LocalKubernetesExecutor
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: