-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nn.Embedding to avoid OneHotEncoding all categorical columns #425
nn.Embedding to avoid OneHotEncoding all categorical columns #425
Conversation
# allows us to pass embed_columns to the dataset properties. | ||
# TODO: test the trade off | ||
# Another solution is to combine `OneHotEncoding`, `Embedding` and `NoEncoding` in one custom transformer. | ||
# this will also allow users to use this transformer outside the pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# this will also allow users to use this transformer outside the pipeline | |
# this will also allow users to use this transformer outside the pipeline, see [this](https://github.com/manujosephv/pytorch_tabular/blob/main/pytorch_tabular/categorical_encoders.py#L132) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the meeting, I reviewed the changes.
autoPyTorch/api/base_task.py
Outdated
@@ -111,7 +111,7 @@ def send_warnings_to_log( | |||
return prediction | |||
|
|||
|
|||
def get_search_updates(categorical_indicator: List[bool]): | |||
def get_search_updates(categorical_indicator: List[bool]) -> HyperparameterSearchSpaceUpdates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method argument is not used, I believe it could be removed.
autoPyTorch/api/base_task.py
Outdated
@@ -267,7 +267,8 @@ def __init__( | |||
|
|||
self.input_validator: Optional[BaseInputValidator] = None | |||
|
|||
self.search_space_updates = search_space_updates if search_space_updates is not None else get_search_updates(categorical_indicator) | |||
# if search_space_updates is not None else get_search_updates(categorical_indicator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could also be removed.
self.logger.debug(f"run_summary_dict {json.dumps(run_summary_dict)}") | ||
with open(os.path.join(self.backend.temporary_directory, 'run_summary.txt'), 'a') as file: | ||
file.write(f"{json.dumps(run_summary_dict)}\n") | ||
# self._write_run_summary(pipeline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the functionality that was encapsulated in the function, I think this should be called here, right?
@@ -297,7 +296,7 @@ def _get_hyperparameter_search_space(self, | |||
""" | |||
raise NotImplementedError() | |||
|
|||
def _add_forbidden_conditions(self, cs): | |||
def _add_forbidden_conditions(self, cs: ConfigurationSpace) -> ConfigurationSpace: | |||
""" | |||
Add forbidden conditions to ensure valid configurations. | |||
Currently, Learned Entity Embedding is only valid when encoder is one hot encoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the chances introduced in the PR I think the first condition mentioned in the docstring regarding Learned Entity Embedding should be removed.
return self | ||
|
||
def transform(self, X: Dict[str, Any]) -> Dict[str, Any]: | ||
if self.num_categories_per_col is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.num_categories_per_col
is initialized as an empty list, which means that it will not be None also for the encoded columns. Maybe this conditions should be changed to:
if self.num_categories_per_col:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be none when there were no categorical column, see line 38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, but line 38 initializes self.num_categories_per_col
to an empty list if there are categorical columns, and [] is not None
returns True
.
I'm mentioning this because I thought in line 53 we check if there are columns to be embedded, currently the if conditions evaluates to true both for embedded and encoded columns.
autoPyTorch/api/base_task.py
Outdated
# has_cat_features = any(categorical_indicator) | ||
# has_numerical_features = not all(categorical_indicator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be removed.
@@ -19,69 +19,59 @@ | |||
class _LearnedEntityEmbedding(nn.Module): | |||
""" Learned entity embedding module for categorical features""" | |||
|
|||
def __init__(self, config: Dict[str, Any], num_input_features: np.ndarray, num_numerical_features: int): | |||
def __init__(self, config: Dict[str, Any], num_categories_per_col: np.ndarray, num_features_excl_embed: int): | |||
""" | |||
Args: | |||
config (Dict[str, Any]): The configuration sampled by the hyperparameter optimizer | |||
num_input_features (np.ndarray): column wise information of number of output columns after transformation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think num_input_features
should be replaced with num_categories_per_col (np.ndarray): number of categories for categorical columns that will be embedded
@@ -289,6 +277,7 @@ def _get_pipeline_steps( | |||
("imputer", SimpleImputer(random_state=self.random_state)), | |||
# ("variance_threshold", VarianceThreshold(random_state=self.random_state)), | |||
# ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)), | |||
("column_splitter", ColumnSplitter(random_state=self.random_state)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docstring of the class should be updated to also include column_splitter
as a step.
("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)), | ||
# ("variance_threshold", VarianceThreshold(random_state=self.random_state)), | ||
# ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)), | ||
("column_splitter", ColumnSplitter(random_state=self.random_state)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as tabular_classification.py
, it would be nice to add this step in the docstring as well.
…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <[email protected]> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <[email protected]>
* reduce number of hyperparameters for pytorch embedding * remove todos for the preprocessing PR, and apply suggestion from code review * remove unwanted exclude in test
This branch will be merged to reg_cocktails. Therefore, this PR has been shifted to #451 |
This PR replaces the
LearnedEntityEmbedding
with pytorch'snn.Embedding
which implicitly one hot encodes categorical columns. This leads to a reduction in memory usage compared to the old version.Types of changes
Motivation and Context
One hot encoding can lead to explosion in memory when the categories per column is high. Using
nn.Embedding
for such categorical columns will significantly reduce memory usage. Moreover, it is a more robust and simpler implementation of the embedding module. To do this, I have introduced a new pipeline step calledColumnSplitter
(I am up for better name suggestions) which hasmin_values_for_embedding
as a hyperparameter.It also makes minor changes which optimise some parts of the library. These include
EarlyPreprocessing
node making it more efficient.self.categories
from tabular feature validator which according to [memo] High memory consumption and the places of doubts #180 takes a lot of memory. We dont really need to store all the categories anyways we only neednum_categories_per_col
.How has this been tested?
I have successfully run
example_tabular_classification
onAustralian
datasets where the default configuration allows us to verify the features introduced in this PR.