Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost_Final table update policy fails with out of memory exception. #1188

Closed
RolandKrummenacher opened this issue Dec 5, 2024 · 13 comments · Fixed by #1216
Closed

Cost_Final table update policy fails with out of memory exception. #1188

RolandKrummenacher opened this issue Dec 5, 2024 · 13 comments · Fixed by #1216
Assignees
Labels
Skill: KQL KQL queries and Data Explorer integration Tool: FinOps hubs Data pipeline solution Type: Bug 🐛 Something isn't working

Comments

@RolandKrummenacher
Copy link
Collaborator

🐛 Issue

During data ingestion, the Cost_Final_V1_0 update policy (function Costs_transform_v1_0) fails due to an "out of memory" exception.

👣 Steps to Reproduce

  1. Setup

    • Configure the FinOps Toolkit with the default settings, using the smallest ADX SKU.
  2. Prepare Data

    • Ensure price data has already been ingested.
  3. Trigger Issue

    • Ingest 50 MB of cost data in Snappy-compressed Parquet format.
  4. Verify Ingestion Failure

    • Run the command .show ingestion failures in the ingestion database.
    • Confirm the presence of an "out of memory" exception.
  5. Verify Operation Failure

    • Run the command .show operations in the ingestion database.
    • Confirm that the Costs_transform_v1_0 update policy has failed.

🤔 Expected Outcome

The ingestion process should successfully complete without any errors.

  • No "out of memory" exceptions should occur during the operation.
  • The Costs_transform_v1_0 update policy should execute without failures, ensuring proper data transformation and ingestion.

ℹ️ Additional context

I suspect that the join in the Costs_transform_v1_0 function might be part of the problem. It could potentially be replaced with a lookup. Maybe other KQL optimizations apply such as materializing the price lookup before joining.

🙋‍♀️ Ask for the community

We could use your help:

  1. Please vote this issue up (👍) to prioritize it.
  2. Leave comments to help us solidify the vision.
@RolandKrummenacher RolandKrummenacher added Needs: Triage 🔍 Untriaged issue needs to be reviewed Type: Bug 🐛 Something isn't working labels Dec 5, 2024
@RolandKrummenacher
Copy link
Collaborator Author

Discussed with @flanakin during our debug session today.

@MSBrett
Copy link
Contributor

MSBrett commented Dec 5, 2024

I ran this command to clean the data from the prices table which seems to let the ingestion complete.
.drop extents from Prices_final_v1_0

@RolandKrummenacher
Copy link
Collaborator Author

I'm also suspecting that the parse_resourceid(ResourceId) call is not performant. It looks like this function is executing a subquery for every record.

@mschwit
Copy link

mschwit commented Dec 9, 2024

I'm also getting similar errors, but for different queries. Errors I usually get are like this (even with bigger and not the smallest ADX SKUs):
Expression.Error: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }').
[0]Kusto.Data.Exceptions.KustoServicePartialQueryFailureLowMemoryConditionException: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }').

@RolandKrummenacher
Copy link
Collaborator Author

I'm also getting similar errors, but for different queries. Errors I usually get are like this (even with bigger and not the smallest ADX SKUs): Expression.Error: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }'). [0]Kusto.Data.Exceptions.KustoServicePartialQueryFailureLowMemoryConditionException: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }').

Do you face this error for your custom queries or for queries of the finops toolkit?

@mschwit
Copy link

mschwit commented Dec 9, 2024

I get this when using the Power BI reports https://github.com/microsoft/finops-toolkit/blob/dev/src/power-bi/kql/CostSummary.pbip and https://github.com/microsoft/finops-toolkit/blob/dev/src/power-bi/kql/RateOptimization.pbip

Even with a cluster of 128 GB Memory, I e.g. get this error:

Query execution has exceeded the allowed limits (80DA0001): 'summarize' operator has exceeded the memory budget (5368709120) during evaluation. Results may be incorrect or incomplete (E_RUNAWAY_QUERY; see https://aka.ms/kustoquerylimits).. [0]Kusto.Data.Exceptions.KustoServicePartialQueryFailureLimitsExceededException: Query execution has exceeded the allowed limits (80DA0001): 'summarize' operator has exceeded the memory budget (5368709120) during evaluation. Results may be incorrect or incomplete (E_RUNAWAY_QUERY; see https://aka.ms/kustoquerylimits).. Timestamp=2024-12-09T13:36:49.8502403Z ClientRequestId=KPBI;81e77b52-ff9a-44f7-a72a-df89a26f30f3;28f52463-ef89-4443-b8b9-a7b1747e05b4;d7b4848d-1f8f-44e2-ba8f-ed785386a4cb ActivityId=90f5e9a6-b259-4e53-bfe4-49207348f03d ActivityType=GW.Http.CallContext ServiceAlias=FINOPSTESTMCB MachineName=KEngine000000 ProcessName=Kusto.WinSvc.Svc ProcessId=6836 ThreadId=7808 ActivityStack=(Activity stack: CRID=KPBI;81e77b52-ff9a-44f7-a72a-df89a26f30f3;28f52463-ef89-4443-b8b9-a7b1747e05b4;d7b4848d-1f8f-44e2-ba8f-ed785386a4cb ARID=90f5e9a6-b259-4e53-bfe4-49207348f03d > GW.Http.CallContext/90f5e9a6-b259-4e53-bfe4-49207348f03d) MonitoredActivityContext=(ActivityType=GW.Http.CallContext, Timestamp=2024-12-09T13:35:54.0927543Z, ParentActivityId=90f5e9a6-b259-4e53-bfe4-49207348f03d, TimeSinceStarted=55757.486 [ms])ErrorCode= ErrorReason= ErrorMessage= DataSource= DatabaseName= ClientRequestId= ActivityId=00000000-0000-0000-0000-000000000000 UnderlyingErrorCode=80DA0001 UnderlyingErrorMessage='summarize' operator has exceeded the memory budget (5368709120) during evaluation. Results may be incorrect or incomplete (E_RUNAWAY_QUERY; see https://aka.ms/kustoquerylimits). . The exception was raised by the IDbCommand interface. Table: HybridBenefitCosts.

@RolandKrummenacher
Copy link
Collaborator Author

n with a cluster of 128 GB Memory, I e.g. get this error:

Although it's a similar error, I think it's not directly related to this issue, since I have the issue at ingestion time, not at query time.
However, are you running the reports with daily or monthly granularity? Have you tried to limit the powerbi report to a 1 month time range?

@mschwit
Copy link

mschwit commented Dec 9, 2024

I tried to limit the Report only to 2 days with a montly scope. But it seems, that the query first builds the result set and the filters after that

@flanakin flanakin added Tool: FinOps hubs Data pipeline solution Skill: KQL KQL queries and Data Explorer integration and removed Needs: Triage 🔍 Untriaged issue needs to be reviewed labels Dec 10, 2024
@flanakin flanakin added this to the 2024-12 - December milestone Dec 10, 2024
@flanakin
Copy link
Collaborator

I'm also suspecting that the parse_resourceid(ResourceId) call is not performant. It looks like this function is executing a subquery for every record.

I was able to ingest data from our test account without a problem. The largest snappy parquet file was 11.7 MB. That said, I also suspect the parse_resourceid() function. There's a join to look up a resource type display name for the ResourceType column.

I enabled 7-day retention for the raw data and ingested 13 months of data. I ran the Costs_transform_v1_0() function, which is what the update policy uses with and without the join and compared the following query metrics:

Metric Before After Diff Percent
Query time 98.188s 18.984s -79.204s -80.6%
CPU time 45.391s 33.047s -12.391s -27.3%
Data scanned 199.1 MB 203.5 MB +4.4 MB +2.21%
Peak memory 6.9 GB 985.7 MB -5.9 GB -86.087%
Data size 72.1 MB 72.2 MB +82.3 B +0.11%
Row count 39,409 39,250 -159 -0.40%

I'll work on removing or replacing this. I have a thought on how we can achieve this with a slightly different approach.

@flanakin
Copy link
Collaborator

Right now, the parse_resourceid() function runs for every row, but it's not needed for most. If I move it to only be called when needed (keeping the join), the performance is even better, but the memory is a little higher than it was without the join.

Performance

  • Query time: 00:06.821 (6.821 seconds)
  • Total CPU time: 00:05.609 (5.609 seconds)
  • Data scanned (estimated): 212.4 MB (222,742,533 bytes)
  • Peak memory: 1.1 GB (1,171,552,608 bytes)

Result

  • Data size: 70.1 MB (73,521,723 bytes)
  • Rows count: 41,292

@flanakin
Copy link
Collaborator

Okay, adding on to the last 2 changes:

  1. Remove the ResourceTypes table join from the parse_resourceid() function
  2. Only call the parse_resourceid() function when needed (and not for every row)
  3. Create a new resource_type() function that does an inline mapping for resource type display names without doing a join
  4. Replace the parse_resourceid() call with resource_type() only when needed
  5. Fixed a bug where x_ResourceType is being set incorrectly in 0.7

Query time is a tad faster, but the memory hasn't changed since the last test. This is likely because the lookup isn't happening much. I'll submit a few PRs covering these changes.

Performance

  • Query time: 00:06.421 (6.421 seconds)
  • Total CPU time: 00:05.281 (5.281 seconds)
  • Data scanned (estimated): 212.5 MB (222,831,681 bytes)
  • Peak memory: 1.1 GB (1,177,547,424 bytes)

Result

  • Data size: 70.1 MB (73,460,816 bytes)
  • Rows count: 40,860

@flanakin
Copy link
Collaborator

@mschwit Can you submit a separate issue for the problem you're running into? I have 3 optimizations that I want to look into after the ingestion memory problem is resolved. Having a separate issue with your specific details will help ensure that doesn't get missed.

@mschwit
Copy link

mschwit commented Dec 16, 2024

Dear @flanakin, I created an additional issue at #1220

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skill: KQL KQL queries and Data Explorer integration Tool: FinOps hubs Data pipeline solution Type: Bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants