Skip to content

Conversation

amahussein
Copy link
Collaborator

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Contributes to #1836

  • uses tools-API in the predict code
  • introduces a new CSV Combiner class that automatically combine the CSV reports generated by multiple-apps and combine them into a single DF.

@amahussein amahussein self-assigned this Aug 4, 2025
@amahussein amahussein added the python Pull requests that update python code label Aug 4, 2025
@github-actions github-actions bot added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Aug 4, 2025
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to NVIDIA#1836

- uses tools-API in the predict code
- introduces a new CSV Combiner class that automatically combine the CSV
  reports generated by multiple-apps and combine them into a single DF.
@amahussein amahussein force-pushed the rapids-tools-1836-01 branch from 513764b to 104f1fb Compare August 4, 2025 20:00
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
@amahussein amahussein requested a review from Copilot August 4, 2025 20:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR migrates the qualx predict functionality to use the new tools-API instead of the legacy QualCoreHandler. It introduces a new CSV Combiner class that automatically combines CSV reports from multiple applications into a single DataFrame, improving the handling of multi-app qualification results.

  • Uses tools-API v1 components (QualCoreResultHandler, APIResultHandler, CSVReport, CSVCombiner) instead of legacy handlers
  • Introduces CSVCombiner class for automatically combining CSV reports from multiple applications
  • Updates type hints and method signatures to use the new API types

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
user_tools/src/spark_rapids_tools/tools/qualx/util.py Updates imports and function signatures to use new API types, replaces QualCoreHandler with APIResultHandler
user_tools/src/spark_rapids_tools/tools/qualx/qualx_main.py Refactors qualification data processing to use CSVReport and CSVCombiner instead of direct handler access
user_tools/src/spark_rapids_tools/tools/qualx/preprocess.py Removes unnecessary groupby operation and adds TODO comment about WholeStageCodegen filtering
user_tools/src/spark_rapids_tools/api_v1/result_handler.py Adds utility methods for checking if handler is empty and getting reader paths
user_tools/src/spark_rapids_tools/api_v1/core/qual_handler.py Adds method to get raw metrics path from qualification handler
user_tools/src/spark_rapids_tools/api_v1/builder.py Introduces new CSVCombiner class for combining multiple CSV reports
user_tools/src/spark_rapids_tools/api_v1/app_handler.py Enhances patch_into_df method to support custom column names
user_tools/src/spark_rapids_pytools/rapids/qualx/prediction.py Simplifies prediction wrapper to use new API pattern
user_tools/src/spark_rapids_pytools/rapids/qualification.py Updates qualification to use APIResultHandler for building handlers

self._combine_args)
# Q: Should we ignore or skip the empty dataframes?
combined_dfs.append(app_df)
final_df = pd.concat(combined_dfs, ignore_index=True)
Copy link
Preview

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will fail with ValueError if combined_dfs is empty. Add a check to handle the case when no DataFrames are available for concatenation.

Suggested change
final_df = pd.concat(combined_dfs, ignore_index=True)
if combined_dfs:
final_df = pd.concat(combined_dfs, ignore_index=True)
else:
final_df = pd.DataFrame()

Copilot uses AI. Check for mistakes.

col_values = [self.app_id]
if col_names is None:
# append attemptId to support multi-attempts
col_names = ['appId']
Copy link
Preview

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zip operation will only iterate over min(len(col_names), len(col_values)) items. Since col_values always has length 1 but col_names can have different lengths, this could skip columns or fail silently.

Suggested change
col_names = ['appId']
col_names = ['appId']
# Ensure col_values matches col_names in length
if len(col_values) == 1 and len(col_names) > 1:
col_values = col_values * len(col_names)
elif len(col_values) != len(col_names):
raise ValueError("Length of col_values must be 1 or match length of col_names")

Copilot uses AI. Check for mistakes.

if not q_sum_res.success:
# That should not happen unless there is something terribly wrong because the caller
# has checked that the result is not empty. Therefore, the summary must exist
raise q_sum_res.load_error
Copy link
Preview

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should be more descriptive. Consider providing context about what failed to load (qualification summary) and the original error details.

Suggested change
raise q_sum_res.load_error
raise RuntimeError("Failed to load qualification summary") from q_sum_res.load_error

Copilot uses AI. Check for mistakes.

.build() # combine the results
)
if not q_execs_res.success:
raise q_execs_res.load_error
Copy link
Preview

Copilot AI Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should be more descriptive. Consider providing context about what failed to load (execution CSV report) and the original error details.

Suggested change
raise q_execs_res.load_error
raise RuntimeError(
f"Failed to load execution CSV report: {q_execs_res.load_error}"
) from q_execs_res.load_error

Copilot uses AI. Check for mistakes.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
:return: the resulting dataframe from adding the columns.
"""
# TODO: We should consider add UUID as well, and use that for the joins instead.
# append attempt_id to support multiple attempts
col_values = [self.app_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird that col_values is a fixed/default list of one value, while col_names can be a user-provided list with multiple values. Also, I'm not sure how I would patch app_name per the docstring. And what would happen if I passed in col_names=['one', 'two', 'three']? Seems like I'd get three columns with the same (appId) value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
I believe that the dev intending to stitch DFs using Apps (default combiner), has to use a column-name that can be mapped to the object. Otherwise, he has to use a custom combiner for example:

  • [appId, app_id, App ID, App Id...] -> all those variations can be reduced to app._app_id field. Otherwise, there won't be a way to tell which col-name will get the value of the app._app_id
  • [app_name, appName, App Name...] ->similarly all can be mapped to app._app_name fields

# patch the app_uuid and if the columns are defined.
return app_entry.patch_into_df(df, col_names=col_names)
return df

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add a default_failed_app_processor which just logs something like "failed to combine: {app_id}". This would provide a useful example (and indirect documentation) for the callback API, e.g. the str argument will be the appId.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.
We can also consider dumping them in a single log-message in order to improve readability of the logs, right?

raise ValueError(f'App entry not found for ID: {app_id}')
if combine_args:
# check if the col_names are provided to stitch the app_ids
col_names = combine_args.get('col_names', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we should rename col_names here, since it sounds like a "selection" list, but it's actually a "patch" list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

if self._combine_args and 'use_cols' in self._combine_args:
# make sure that we insert the columns to the empty dataframe
injected_cols = pd.DataFrame(columns=self._combine_args['use_cols'])
return pd.concat([injected_cols, empty_df], axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also add columns from 'col_names' as well (or is that a subset of 'use_cols')?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was a typo. it should be col_names.
However, this reminded me that the empty df should account if the CSVReport had arguments to rename-cols.

cb_fn: Callable[[ToolResultHandlerT, str, pd.DataFrame, dict], pd.DataFrame]) -> 'CSVCombiner':
"""Set the processor for successful applications."""
self._success_app_processor = cb_fn
return self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, I'd settle on either processor or cb_fn arg for both of these, unless there was a good reason for this distinction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.

self._failed_app_processor(app_id, app_res)
else:
# default behavior is to skip the app
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a default failed_app_processor, then can simplify this block.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can failed_app_processor raise an exception? Might want to document expected behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great attention to details +1

Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leewyang for the feedback.

self._failed_app_processor(app_id, app_res)
else:
# default behavior is to skip the app
continue
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great attention to details +1

:return: the resulting dataframe from adding the columns.
"""
# TODO: We should consider add UUID as well, and use that for the joins instead.
# append attempt_id to support multiple attempts
col_values = [self.app_id]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
I believe that the dev intending to stitch DFs using Apps (default combiner), has to use a column-name that can be mapped to the object. Otherwise, he has to use a custom combiner for example:

  • [appId, app_id, App ID, App Id...] -> all those variations can be reduced to app._app_id field. Otherwise, there won't be a way to tell which col-name will get the value of the app._app_id
  • [app_name, appName, App Name...] ->similarly all can be mapped to app._app_name fields

# patch the app_uuid and if the columns are defined.
return app_entry.patch_into_df(df, col_names=col_names)
return df

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion.
We can also consider dumping them in a single log-message in order to improve readability of the logs, right?

raise ValueError(f'App entry not found for ID: {app_id}')
if combine_args:
# check if the col_names are provided to stitch the app_ids
col_names = combine_args.get('col_names', None)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

if self._combine_args and 'use_cols' in self._combine_args:
# make sure that we insert the columns to the empty dataframe
injected_cols = pd.DataFrame(columns=self._combine_args['use_cols'])
return pd.concat([injected_cols, empty_df], axis=1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was a typo. it should be col_names.
However, this reminded me that the empty df should account if the CSVReport had arguments to rename-cols.

cb_fn: Callable[[ToolResultHandlerT, str, pd.DataFrame, dict], pd.DataFrame]) -> 'CSVCombiner':
"""Set the processor for successful applications."""
self._success_app_processor = cb_fn
return self
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.

# See the License for the specific language governing permissions and
# limitations under the License.

"""Utility functions for featurizer for QualX"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this, since featurizers will be the folder containing pluggable code, e.g. I think most of this code was in featurizers/default.py (i.e. the default featurizer for qualx)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this, since featurizers will be the folder containing pluggable code, e.g. I think most of this code was in featurizers/default.py (i.e. the default featurizer for qualx)

Absolutely, this is just a temp in order to keep the shadow implementations in 1 place. Eventually, the new code will migrate to the original qualX packages.

:param arg: The DataFrame to check.
:param allow_empty: If True, allow the DataFrame to be empty.
:return: True if the DataFrame is valid, False otherwise.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be pulled out as an API that other plugin writers can leverage it (if we want this to be the desired usage pattern).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, ok I was just thinking if such simple method would be needed to be in a common place.

# Step-a: combine some of the columns.
# Step-b: return the processed DataFrame with the features.

app_info_df = raw_tbls.reports.get('coreRawApplicationInformationCSV')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: it would be nice if we had an enum or something that showed the possible reports, s.t. we can use IDE auto-completion. Right now, I'm not sure where I'd find the definitive list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: it would be nice if we had an enum or something that showed the possible reports, s.t. we can use IDE auto-completion. Right now, I'm not sure where I'd find the definitive list.

It is doable, it might need some extra work to do. We can certainly do that once things are ready.

There should be an auto-generated markdown that lists all labels and their fields.
For now, there are 4 yaml files that contain those new labels. https://github.com/NVIDIA/spark-rapids-tools/blob/4f9abf038d7d0aa48bf0594f68694e21a38e0a19/core/src/main/resources/configs/reports/coreRawMetricsReport.yaml
When opening in IDE, the files are copied into the wrapper resources folder
user_tools/src/spark_rapids_pytools/resources/generated_files/core/reports


# Allow user-provided 'ds_name' as 'appName'
# TODO: Do we still need the line below?
app_info_df['appName'] = raw_tbls.ds_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still needed (given assumptions in the code), but I'd like to eventually just inject a separate ds_name column, instead of overwriting the existing appName.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still needed (given assumptions in the code), but I'd like to eventually just inject a separate ds_name column, instead of overwriting the existing appName.

I like the idea of injecting column ds_name. The original code made me confused because I was not sure why we would overwrite the app_name and lose that information.

Another question: the original code was injecting app_name in most of the feature tables. I assume you want to inject ds_name now in all of them. Do we still need in all other feature tables too?

################################################
# coreRawSqlDurationAndExecutorCpuTimePercentCSV
################################################
# TODO: legacy code overwrite the SQLDuration with the AppDuration. It is not clear why we do so?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was due to this and this.

So, it's not replacing the SQLDuration, but patching the AppDuration column of the sql_duration_and_executor_cpu_time_percent.csv.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was due to this and this.

I believe this was an old behavior and currently the raw-metrics contain the appDuration.
I looked at the output of some "inprogress" eventlogs, and I found that the appDuration is set in all of them.

# This will raise a keyError if app_meta is not defined.
app_meta = ds_meta['app_meta']
# get the array of resultHandlers from the dataset metadata.
core_res_handlers = ds_meta['resultHandlers']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think that we may not want to inject resultHandlers into the ds_meta, but instead have something like this:

core_res_handlers = get_result_handlers(ds_meta)

Or something that loads all of the handlers and injects them within this function.

The main reason is that load_profiles() is one of the top-level APIs that is often used in notebooks (to load the profile_df), so it'd be nice if the caller didn't have to modify/mutate the datasets before invoking this. Obviously if there is a good reason why it has to be done this way, we can just push the user-friendly API up a level, e.g. to load_datasets(). Basically, the user shouldn't really need to know about resultHandlers when they just want to load the profile_df dataframe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think that we may not want to inject resultHandlers into the ds_meta, but instead have something like this:

core_res_handlers = get_result_handlers(ds_meta)

Or something that loads all of the handlers and injects them within this function.

The main reason is that load_profiles() is one of the top-level APIs that is often used in notebooks (to load the profile_df), so it'd be nice if the caller didn't have to modify/mutate the datasets before invoking this. Obviously if there is a good reason why it has to be done this way, we can just push the user-friendly API up a level, e.g. to load_datasets(). Basically, the user shouldn't really need to know about resultHandlers when they just want to load the profile_df dataframe.

I see! Lets iterate over this offline to define what is the best way to do that.
As long as we can send the tools output-dir and which tool it was, then I think we should be fine. By sending the output-dir or list of output-directories, we can then initialize the resultHandlers inside the featurizer.

Basically, the user shouldn't really need to know about resultHandlers when they just want to load the profile_df dataframe.

I understand your point, but given that we are pushing "ToolsAPI" to make it easy for users to consume and process the reports, then the user need to be familiar with it.

predictions_df = predict(platform=model_name, qual=qual_output_dir,
output_info=output_info,
model=estimation_model_args['customModelFile'],
qual_handlers=[q_core_handler])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to eventually push the branching down into the core qualx APIs, just so all invocations to predict() can use the switch, but this is fine (and easier) for now.

Copy link
Collaborator Author

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leewyang for the feedback.

# This will raise a keyError if app_meta is not defined.
app_meta = ds_meta['app_meta']
# get the array of resultHandlers from the dataset metadata.
core_res_handlers = ds_meta['resultHandlers']
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think that we may not want to inject resultHandlers into the ds_meta, but instead have something like this:

core_res_handlers = get_result_handlers(ds_meta)

Or something that loads all of the handlers and injects them within this function.

The main reason is that load_profiles() is one of the top-level APIs that is often used in notebooks (to load the profile_df), so it'd be nice if the caller didn't have to modify/mutate the datasets before invoking this. Obviously if there is a good reason why it has to be done this way, we can just push the user-friendly API up a level, e.g. to load_datasets(). Basically, the user shouldn't really need to know about resultHandlers when they just want to load the profile_df dataframe.

I see! Lets iterate over this offline to define what is the best way to do that.
As long as we can send the tools output-dir and which tool it was, then I think we should be fine. By sending the output-dir or list of output-directories, we can then initialize the resultHandlers inside the featurizer.

Basically, the user shouldn't really need to know about resultHandlers when they just want to load the profile_df dataframe.

I understand your point, but given that we are pushing "ToolsAPI" to make it easy for users to consume and process the reports, then the user need to be familiar with it.

################################################
# coreRawSqlDurationAndExecutorCpuTimePercentCSV
################################################
# TODO: legacy code overwrite the SQLDuration with the AppDuration. It is not clear why we do so?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was due to this and this.

I believe this was an old behavior and currently the raw-metrics contain the appDuration.
I looked at the output of some "inprogress" eventlogs, and I found that the appDuration is set in all of them.


# Allow user-provided 'ds_name' as 'appName'
# TODO: Do we still need the line below?
app_info_df['appName'] = raw_tbls.ds_name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still needed (given assumptions in the code), but I'd like to eventually just inject a separate ds_name column, instead of overwriting the existing appName.

I like the idea of injecting column ds_name. The original code made me confused because I was not sure why we would overwrite the app_name and lose that information.

Another question: the original code was injecting app_name in most of the feature tables. I assume you want to inject ds_name now in all of them. Do we still need in all other feature tables too?

# Step-a: combine some of the columns.
# Step-b: return the processed DataFrame with the features.

app_info_df = raw_tbls.reports.get('coreRawApplicationInformationCSV')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: it would be nice if we had an enum or something that showed the possible reports, s.t. we can use IDE auto-completion. Right now, I'm not sure where I'd find the definitive list.

It is doable, it might need some extra work to do. We can certainly do that once things are ready.

There should be an auto-generated markdown that lists all labels and their fields.
For now, there are 4 yaml files that contain those new labels. https://github.com/NVIDIA/spark-rapids-tools/blob/4f9abf038d7d0aa48bf0594f68694e21a38e0a19/core/src/main/resources/configs/reports/coreRawMetricsReport.yaml
When opening in IDE, the files are copied into the wrapper resources folder
user_tools/src/spark_rapids_pytools/resources/generated_files/core/reports

:param arg: The DataFrame to check.
:param allow_empty: If True, allow the DataFrame to be empty.
:return: True if the DataFrame is valid, False otherwise.
"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, ok I was just thinking if such simple method would be needed to be in a common place.

# See the License for the specific language governing permissions and
# limitations under the License.

"""Utility functions for featurizer for QualX"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this, since featurizers will be the folder containing pluggable code, e.g. I think most of this code was in featurizers/default.py (i.e. the default featurizer for qualx)

Absolutely, this is just a temp in order to keep the shadow implementations in 1 place. Eventually, the new code will migrate to the original qualX packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update python code user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants