Resolve OOM when reading large logs in webserver #45079

jason810496 · 2024-12-19T14:03:48Z

Description

TL;DR

After conducting some research and implementing a POC, I would like to propose a potential solution. However, this solution requires changes to the airflow.utils.log.file_task_handler.FileTaskHandler. If the solution is accepted, it will necessitate modifications to 10 providers that extend the FileTaskHandler class.

Main Concept for Refactoring

The proposed solution focuses on:

Returning a generator instead of loading the entire file content at once.
Leveraging a heap to merge logs incrementally, rather than sorting entire chunks.

The POC for this refactoring shows a 90% reduction in memory usage with similar processing times!

Experiment Details

830 MB
Approximately 8,670,000 lines

Main Root Causes of OOM

_interleave_logs Function in airflow.utils.log.file_task_handler

Extends all log strings into the records list.
Sorts the entire records list.
Yields lines with deduplication.

_read Method in airflow.utils.log.file_task_handler.FileTaskHandler

Joins all aggregated logs into a single string using:
```
"\n".join(_interleave_logs(all_log_sources))
```

Methods That Use _read:
These methods read the entire log content and return it as a string instead of a generator:
- _read_from_local
- _read_from_logs_server
- _read_remote_logs (Implemented by providers)

Proposed Refactoring Solution

The main concept includes:

Return a generator for reading log sources (local or external) instead of whole file content as string.
Merge logs using K-Way Merge instead of Sorting
- Since each source of logs is already sorted, merge them incrementally using heapq with streams of logs.
- Return a stream of the merged result.

Breaking Changes in This Solution

Interface of the read Method in FileTaskHandler:
- Will now return a generator instead of a string.
Interfaces of read_log_chunks and read_log_stream in TaskLogReader:
- Adjustments to support the generator-based approach.
Methods That Use _read
- _read_from_local
- _read_from_logs_server
- _read_remote_logs ( there are 10 providers implement this method )

Experimental Environment:

Setup: Docker Compose without memory limits.
Memory Profiling: memray
Log Size: 830 MB, about 8670000 lines

Benchmark Metrics

Original Implementation:
- Memory Usage: Average 3GB, peaks at 4GB when returning the final stream.
- Processing Time: ~60 seconds.
- Memory Flame Graph
  - https://www.zhu424.dev/Airflow-Webserver-Resolving-OOM-for-Large-Log-Reads/memray-flamegraph-memray_logs.py.html
POC (Refactored Implementation):
- Memory Usage: Average 300MB.
- Processing Time: ~60 seconds.
- Memory Flame Graph
  - https://www.zhu424.dev/Airflow-Webserver-Resolving-OOM-for-Large-Log-Reads/memray-flamegraph-read_large_logs-k-way-merge-heap-optimize.py.html

Summary

Feel free to share any feedback! I believe we should have more discussions before adopting this solution, as it involves breaking changes to the FileTaskHandler interface and requires refactoring in 10 providers as well.

Related issues

#44753

TODO Tasks

Refactor to Use K-Way Merge for Log Streams Instead of Sorting Entire Log Records WIP: [Resolve OOM When Reading Large Logs in Webserver] Refactor to Use K-Way Merge for Log Streams Instead of Sorting Entire Log Records #45129
Refactor providers that implemented _read_remote_logs
- Amazon AWS S3
- Apache HDFS
- Google Cloud GCS
- Microsoft Azure WASB
Refactor executors that implemented get_task_log
- CeleryKubernetesExecutor
- KubernetesExecutor
- LocalKubernetesExecutor
Remove compatible utility

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2024-12-19T14:03:52Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

potiuk · 2024-12-19T14:17:17Z

Yes. That's exactly how I envisioned solving this problem. @dstandish ?

potiuk · 2024-12-19T14:18:44Z

FYI. Breaking changes to FileTaskHandler is not a problem - we can work out back-compatibility or simply break it for Airflow 3 - this is not a big deal, since this is only a deployment configuration and does not require DAG adaptations.

jason810496 · 2024-12-19T14:22:37Z

Hi @potiuk,

Would it be okay if I treat this issue as an umbrella issue to track other TODO tasks while refactoring each provider? Or would it be more preferable to refactor FileTaskHandler and all providers in a single PR in this case? Thanks !

potiuk · 2024-12-19T14:53:44Z

Sure. It can be separate set of PRs and that issue can remain "umbrella" - you do not need to have more issues. PRs are enough

dstandish · 2024-12-19T21:55:23Z

Yes. That's exactly how I envisioned solving this problem. @dstandish ?

IIRC this should be fine when task done but may present challenges when task is in flight because at any moment the location of the logs may shift eg from worker to remote storage etc

potiuk · 2024-12-19T21:59:54Z

IIRC this should be fine when task done but may present challenges when task is in flight because at any moment the location of the logs may shift eg from worker to remote storage etc

Is it not the same case now?

tirkarthi · 2024-12-20T03:20:16Z

Related issue #31105

jason810496 · 2024-12-20T03:34:32Z

Yes. That's exactly how I envisioned solving this problem. @dstandish ?

IIRC this should be fine when task done but may present challenges when task is in flight because at any moment the location of the logs may shift eg from worker to remote storage etc

Taking S3TaskHandler as an example, it requires additional refactoring and might need a read_stream method added to S3Hook that returns a generator-based result:
https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/amazon/aws/log/s3_task_handler.py#L136-L192

From my perspective, for the s3_write case, I would download the old log as temporary file and append the new log stream into a temporary file, and use the upload_file method to upload the file to prevent memory starvation and remain the same result.

potiuk · 2024-12-20T08:45:50Z

Taking S3TaskHandler as an example, it requires additional refactoring and might need a read_stream method added to S3Hook that returns a generator-based result:
https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/amazon/aws/log/s3_task_handler.py#L136-L192

From my perspective, for the s3_write case, I would download the old log as temporary file and append the new log stream into a temporary file, and use the upload_file method to upload the file to prevent memory starvation and remain the same result.

Yep. There will be dga cases like that. And yes the proposed method is good.

jason810496 added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Dec 19, 2024

dosubot bot added area:logging area:providers labels Dec 19, 2024

potiuk removed the needs-triage label for new issues that we didn't triage yet label Dec 19, 2024

Lee-W assigned jason810496 Dec 20, 2024

potiuk mentioned this issue Dec 20, 2024

Support concurrent logs collection for containers in KubernetesPodOperator #45061

Open

2 tasks

jason810496 mentioned this issue Dec 21, 2024

WIP: [Resolve OOM When Reading Large Logs in Webserver] Refactor to Use K-Way Merge for Log Streams Instead of Sorting Entire Log Records #45129

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve OOM when reading large logs in webserver #45079

Resolve OOM when reading large logs in webserver #45079

jason810496 commented Dec 19, 2024 •

edited

Loading

boring-cyborg bot commented Dec 19, 2024

potiuk commented Dec 19, 2024

potiuk commented Dec 19, 2024

jason810496 commented Dec 19, 2024

potiuk commented Dec 19, 2024

dstandish commented Dec 19, 2024

potiuk commented Dec 19, 2024

tirkarthi commented Dec 20, 2024

jason810496 commented Dec 20, 2024 •

edited

Loading

potiuk commented Dec 20, 2024

Resolve OOM when reading large logs in webserver #45079

Resolve OOM when reading large logs in webserver #45079

Comments

jason810496 commented Dec 19, 2024 • edited Loading

Description

Main Concept for Refactoring

Experiment Details

Main Root Causes of OOM

Proposed Refactoring Solution

Breaking Changes in This Solution

Experimental Environment:

Benchmark Metrics

Summary

Related issues

TODO Tasks

Are you willing to submit a PR?

Code of Conduct

boring-cyborg bot commented Dec 19, 2024

potiuk commented Dec 19, 2024

potiuk commented Dec 19, 2024

jason810496 commented Dec 19, 2024

potiuk commented Dec 19, 2024

dstandish commented Dec 19, 2024

potiuk commented Dec 19, 2024

tirkarthi commented Dec 20, 2024

jason810496 commented Dec 20, 2024 • edited Loading

potiuk commented Dec 20, 2024

jason810496 commented Dec 19, 2024 •

edited

Loading

jason810496 commented Dec 20, 2024 •

edited

Loading