-
Notifications
You must be signed in to change notification settings - Fork 45
Use tools-API in qualx predict #1838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to NVIDIA#1836 - uses tools-API in the predict code - introduces a new CSV Combiner class that automatically combine the CSV reports generated by multiple-apps and combine them into a single DF.
513764b
to
104f1fb
Compare
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR migrates the qualx predict functionality to use the new tools-API instead of the legacy QualCoreHandler. It introduces a new CSV Combiner class that automatically combines CSV reports from multiple applications into a single DataFrame, improving the handling of multi-app qualification results.
- Uses tools-API v1 components (QualCoreResultHandler, APIResultHandler, CSVReport, CSVCombiner) instead of legacy handlers
- Introduces CSVCombiner class for automatically combining CSV reports from multiple applications
- Updates type hints and method signatures to use the new API types
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
user_tools/src/spark_rapids_tools/tools/qualx/util.py | Updates imports and function signatures to use new API types, replaces QualCoreHandler with APIResultHandler |
user_tools/src/spark_rapids_tools/tools/qualx/qualx_main.py | Refactors qualification data processing to use CSVReport and CSVCombiner instead of direct handler access |
user_tools/src/spark_rapids_tools/tools/qualx/preprocess.py | Removes unnecessary groupby operation and adds TODO comment about WholeStageCodegen filtering |
user_tools/src/spark_rapids_tools/api_v1/result_handler.py | Adds utility methods for checking if handler is empty and getting reader paths |
user_tools/src/spark_rapids_tools/api_v1/core/qual_handler.py | Adds method to get raw metrics path from qualification handler |
user_tools/src/spark_rapids_tools/api_v1/builder.py | Introduces new CSVCombiner class for combining multiple CSV reports |
user_tools/src/spark_rapids_tools/api_v1/app_handler.py | Enhances patch_into_df method to support custom column names |
user_tools/src/spark_rapids_pytools/rapids/qualx/prediction.py | Simplifies prediction wrapper to use new API pattern |
user_tools/src/spark_rapids_pytools/rapids/qualification.py | Updates qualification to use APIResultHandler for building handlers |
self._combine_args) | ||
# Q: Should we ignore or skip the empty dataframes? | ||
combined_dfs.append(app_df) | ||
final_df = pd.concat(combined_dfs, ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code will fail with ValueError if combined_dfs is empty. Add a check to handle the case when no DataFrames are available for concatenation.
final_df = pd.concat(combined_dfs, ignore_index=True) | |
if combined_dfs: | |
final_df = pd.concat(combined_dfs, ignore_index=True) | |
else: | |
final_df = pd.DataFrame() |
Copilot uses AI. Check for mistakes.
col_values = [self.app_id] | ||
if col_names is None: | ||
# append attemptId to support multi-attempts | ||
col_names = ['appId'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zip operation will only iterate over min(len(col_names), len(col_values)) items. Since col_values always has length 1 but col_names can have different lengths, this could skip columns or fail silently.
col_names = ['appId'] | |
col_names = ['appId'] | |
# Ensure col_values matches col_names in length | |
if len(col_values) == 1 and len(col_names) > 1: | |
col_values = col_values * len(col_names) | |
elif len(col_values) != len(col_names): | |
raise ValueError("Length of col_values must be 1 or match length of col_names") |
Copilot uses AI. Check for mistakes.
if not q_sum_res.success: | ||
# That should not happen unless there is something terribly wrong because the caller | ||
# has checked that the result is not empty. Therefore, the summary must exist | ||
raise q_sum_res.load_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message should be more descriptive. Consider providing context about what failed to load (qualification summary) and the original error details.
raise q_sum_res.load_error | |
raise RuntimeError("Failed to load qualification summary") from q_sum_res.load_error |
Copilot uses AI. Check for mistakes.
.build() # combine the results | ||
) | ||
if not q_execs_res.success: | ||
raise q_execs_res.load_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message should be more descriptive. Consider providing context about what failed to load (execution CSV report) and the original error details.
raise q_execs_res.load_error | |
raise RuntimeError( | |
f"Failed to load execution CSV report: {q_execs_res.load_error}" | |
) from q_execs_res.load_error |
Copilot uses AI. Check for mistakes.
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
:return: the resulting dataframe from adding the columns. | ||
""" | ||
# TODO: We should consider add UUID as well, and use that for the joins instead. | ||
# append attempt_id to support multiple attempts | ||
col_values = [self.app_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels weird that col_values
is a fixed/default list of one value, while col_names
can be a user-provided list with multiple values. Also, I'm not sure how I would patch app_name
per the docstring. And what would happen if I passed in col_names=['one', 'two', 'three']
? Seems like I'd get three columns with the same (appId) value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
I believe that the dev intending to stitch DFs using Apps (default combiner), has to use a column-name that can be mapped to the object. Otherwise, he has to use a custom combiner for example:
- [
appId
,app_id
,App ID
,App Id
...] -> all those variations can be reduced toapp._app_id
field. Otherwise, there won't be a way to tell whichcol-name
will get the value of theapp._app_id
- [
app_name
,appName
,App Name
...] ->similarly all can be mapped toapp._app_name
fields
# patch the app_uuid and if the columns are defined. | ||
return app_entry.patch_into_df(df, col_names=col_names) | ||
return df | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add a default_failed_app_processor
which just logs something like "failed to combine: {app_id}". This would provide a useful example (and indirect documentation) for the callback API, e.g. the str
argument will be the appId
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
We can also consider dumping them in a single log-message in order to improve readability of the logs, right?
raise ValueError(f'App entry not found for ID: {app_id}') | ||
if combine_args: | ||
# check if the col_names are provided to stitch the app_ids | ||
col_names = combine_args.get('col_names', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if we should rename col_names
here, since it sounds like a "selection" list, but it's actually a "patch" list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
if self._combine_args and 'use_cols' in self._combine_args: | ||
# make sure that we insert the columns to the empty dataframe | ||
injected_cols = pd.DataFrame(columns=self._combine_args['use_cols']) | ||
return pd.concat([injected_cols, empty_df], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also add columns from 'col_names' as well (or is that a subset of 'use_cols')?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this was a typo. it should be col_names
.
However, this reminded me that the empty df should account if the CSVReport had arguments to rename-cols.
cb_fn: Callable[[ToolResultHandlerT, str, pd.DataFrame, dict], pd.DataFrame]) -> 'CSVCombiner': | ||
"""Set the processor for successful applications.""" | ||
self._success_app_processor = cb_fn | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, I'd settle on either processor
or cb_fn
arg for both of these, unless there was a good reason for this distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point.
self._failed_app_processor(app_id, app_res) | ||
else: | ||
# default behavior is to skip the app | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a default failed_app_processor
, then can simplify this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can failed_app_processor
raise an exception? Might want to document expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great attention to details +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @leewyang for the feedback.
self._failed_app_processor(app_id, app_res) | ||
else: | ||
# default behavior is to skip the app | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great attention to details +1
:return: the resulting dataframe from adding the columns. | ||
""" | ||
# TODO: We should consider add UUID as well, and use that for the joins instead. | ||
# append attempt_id to support multiple attempts | ||
col_values = [self.app_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
I believe that the dev intending to stitch DFs using Apps (default combiner), has to use a column-name that can be mapped to the object. Otherwise, he has to use a custom combiner for example:
- [
appId
,app_id
,App ID
,App Id
...] -> all those variations can be reduced toapp._app_id
field. Otherwise, there won't be a way to tell whichcol-name
will get the value of theapp._app_id
- [
app_name
,appName
,App Name
...] ->similarly all can be mapped toapp._app_name
fields
# patch the app_uuid and if the columns are defined. | ||
return app_entry.patch_into_df(df, col_names=col_names) | ||
return df | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
We can also consider dumping them in a single log-message in order to improve readability of the logs, right?
raise ValueError(f'App entry not found for ID: {app_id}') | ||
if combine_args: | ||
# check if the col_names are provided to stitch the app_ids | ||
col_names = combine_args.get('col_names', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
if self._combine_args and 'use_cols' in self._combine_args: | ||
# make sure that we insert the columns to the empty dataframe | ||
injected_cols = pd.DataFrame(columns=self._combine_args['use_cols']) | ||
return pd.concat([injected_cols, empty_df], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this was a typo. it should be col_names
.
However, this reminded me that the empty df should account if the CSVReport had arguments to rename-cols.
cb_fn: Callable[[ToolResultHandlerT, str, pd.DataFrame, dict], pd.DataFrame]) -> 'CSVCombiner': | ||
"""Set the processor for successful applications.""" | ||
self._success_app_processor = cb_fn | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point.
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
"""Utility functions for featurizer for QualX""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename this, since featurizers
will be the folder containing pluggable code, e.g. I think most of this code was in featurizers/default.py
(i.e. the default featurizer for qualx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename this, since
featurizers
will be the folder containing pluggable code, e.g. I think most of this code was infeaturizers/default.py
(i.e. the default featurizer for qualx)
Absolutely, this is just a temp in order to keep the shadow implementations in 1 place. Eventually, the new code will migrate to the original qualX packages.
:param arg: The DataFrame to check. | ||
:param allow_empty: If True, allow the DataFrame to be empty. | ||
:return: True if the DataFrame is valid, False otherwise. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be pulled out as an API that other plugin writers can leverage it (if we want this to be the desired usage pattern).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, ok I was just thinking if such simple method would be needed to be in a common place.
# Step-a: combine some of the columns. | ||
# Step-b: return the processed DataFrame with the features. | ||
|
||
app_info_df = raw_tbls.reports.get('coreRawApplicationInformationCSV') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: it would be nice if we had an enum or something that showed the possible reports, s.t. we can use IDE auto-completion. Right now, I'm not sure where I'd find the definitive list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: it would be nice if we had an enum or something that showed the possible reports, s.t. we can use IDE auto-completion. Right now, I'm not sure where I'd find the definitive list.
It is doable, it might need some extra work to do. We can certainly do that once things are ready.
There should be an auto-generated markdown that lists all labels and their fields.
For now, there are 4 yaml files that contain those new labels. https://github.com/NVIDIA/spark-rapids-tools/blob/4f9abf038d7d0aa48bf0594f68694e21a38e0a19/core/src/main/resources/configs/reports/coreRawMetricsReport.yaml
When opening in IDE, the files are copied into the wrapper resources folder
user_tools/src/spark_rapids_pytools/resources/generated_files/core/reports
|
||
# Allow user-provided 'ds_name' as 'appName' | ||
# TODO: Do we still need the line below? | ||
app_info_df['appName'] = raw_tbls.ds_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still needed (given assumptions in the code), but I'd like to eventually just inject a separate ds_name
column, instead of overwriting the existing appName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still needed (given assumptions in the code), but I'd like to eventually just inject a separate
ds_name
column, instead of overwriting the existingappName
.
I like the idea of injecting column ds_name. The original code made me confused because I was not sure why we would overwrite the app_name and lose that information.
Another question: the original code was injecting app_name in most of the feature tables. I assume you want to inject ds_name now in all of them. Do we still need in all other feature tables too?
################################################ | ||
# coreRawSqlDurationAndExecutorCpuTimePercentCSV | ||
################################################ | ||
# TODO: legacy code overwrite the SQLDuration with the AppDuration. It is not clear why we do so? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This will raise a keyError if app_meta is not defined. | ||
app_meta = ds_meta['app_meta'] | ||
# get the array of resultHandlers from the dataset metadata. | ||
core_res_handlers = ds_meta['resultHandlers'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still think that we may not want to inject resultHandlers
into the ds_meta, but instead have something like this:
core_res_handlers = get_result_handlers(ds_meta)
Or something that loads all of the handlers and injects them within this function.
The main reason is that load_profiles()
is one of the top-level APIs that is often used in notebooks (to load the profile_df
), so it'd be nice if the caller didn't have to modify/mutate the datasets before invoking this. Obviously if there is a good reason why it has to be done this way, we can just push the user-friendly API up a level, e.g. to load_datasets()
. Basically, the user shouldn't really need to know about resultHandlers when they just want to load the profile_df
dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still think that we may not want to inject
resultHandlers
into the ds_meta, but instead have something like this:core_res_handlers = get_result_handlers(ds_meta)
Or something that loads all of the handlers and injects them within this function.
The main reason is that
load_profiles()
is one of the top-level APIs that is often used in notebooks (to load theprofile_df
), so it'd be nice if the caller didn't have to modify/mutate the datasets before invoking this. Obviously if there is a good reason why it has to be done this way, we can just push the user-friendly API up a level, e.g. toload_datasets()
. Basically, the user shouldn't really need to know about resultHandlers when they just want to load theprofile_df
dataframe.
I see! Lets iterate over this offline to define what is the best way to do that.
As long as we can send the tools output-dir and which tool it was, then I think we should be fine. By sending the output-dir or list of output-directories, we can then initialize the resultHandlers inside the featurizer.
Basically, the user shouldn't really need to know about resultHandlers when they just want to load the
profile_df
dataframe.
I understand your point, but given that we are pushing "ToolsAPI" to make it easy for users to consume and process the reports, then the user need to be familiar with it.
predictions_df = predict(platform=model_name, qual=qual_output_dir, | ||
output_info=output_info, | ||
model=estimation_model_args['customModelFile'], | ||
qual_handlers=[q_core_handler]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to eventually push the branching down into the core qualx APIs, just so all invocations to predict()
can use the switch, but this is fine (and easier) for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @leewyang for the feedback.
# This will raise a keyError if app_meta is not defined. | ||
app_meta = ds_meta['app_meta'] | ||
# get the array of resultHandlers from the dataset metadata. | ||
core_res_handlers = ds_meta['resultHandlers'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still think that we may not want to inject
resultHandlers
into the ds_meta, but instead have something like this:core_res_handlers = get_result_handlers(ds_meta)
Or something that loads all of the handlers and injects them within this function.
The main reason is that
load_profiles()
is one of the top-level APIs that is often used in notebooks (to load theprofile_df
), so it'd be nice if the caller didn't have to modify/mutate the datasets before invoking this. Obviously if there is a good reason why it has to be done this way, we can just push the user-friendly API up a level, e.g. toload_datasets()
. Basically, the user shouldn't really need to know about resultHandlers when they just want to load theprofile_df
dataframe.
I see! Lets iterate over this offline to define what is the best way to do that.
As long as we can send the tools output-dir and which tool it was, then I think we should be fine. By sending the output-dir or list of output-directories, we can then initialize the resultHandlers inside the featurizer.
Basically, the user shouldn't really need to know about resultHandlers when they just want to load the
profile_df
dataframe.
I understand your point, but given that we are pushing "ToolsAPI" to make it easy for users to consume and process the reports, then the user need to be familiar with it.
################################################ | ||
# coreRawSqlDurationAndExecutorCpuTimePercentCSV | ||
################################################ | ||
# TODO: legacy code overwrite the SQLDuration with the AppDuration. It is not clear why we do so? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# Allow user-provided 'ds_name' as 'appName' | ||
# TODO: Do we still need the line below? | ||
app_info_df['appName'] = raw_tbls.ds_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still needed (given assumptions in the code), but I'd like to eventually just inject a separate
ds_name
column, instead of overwriting the existingappName
.
I like the idea of injecting column ds_name. The original code made me confused because I was not sure why we would overwrite the app_name and lose that information.
Another question: the original code was injecting app_name in most of the feature tables. I assume you want to inject ds_name now in all of them. Do we still need in all other feature tables too?
# Step-a: combine some of the columns. | ||
# Step-b: return the processed DataFrame with the features. | ||
|
||
app_info_df = raw_tbls.reports.get('coreRawApplicationInformationCSV') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: it would be nice if we had an enum or something that showed the possible reports, s.t. we can use IDE auto-completion. Right now, I'm not sure where I'd find the definitive list.
It is doable, it might need some extra work to do. We can certainly do that once things are ready.
There should be an auto-generated markdown that lists all labels and their fields.
For now, there are 4 yaml files that contain those new labels. https://github.com/NVIDIA/spark-rapids-tools/blob/4f9abf038d7d0aa48bf0594f68694e21a38e0a19/core/src/main/resources/configs/reports/coreRawMetricsReport.yaml
When opening in IDE, the files are copied into the wrapper resources folder
user_tools/src/spark_rapids_pytools/resources/generated_files/core/reports
:param arg: The DataFrame to check. | ||
:param allow_empty: If True, allow the DataFrame to be empty. | ||
:return: True if the DataFrame is valid, False otherwise. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, ok I was just thinking if such simple method would be needed to be in a common place.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
"""Utility functions for featurizer for QualX""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename this, since
featurizers
will be the folder containing pluggable code, e.g. I think most of this code was infeaturizers/default.py
(i.e. the default featurizer for qualx)
Absolutely, this is just a temp in order to keep the shadow implementations in 1 place. Eventually, the new code will migrate to the original qualX packages.
Signed-off-by: Ahmed Hussein (amahussein) [email protected]
Contributes to #1836