Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nn.Embedding to avoid OneHotEncoding all categorical columns #425

Conversation

ravinkohli
Copy link
Contributor

@ravinkohli ravinkohli commented Mar 31, 2022

This PR replaces the LearnedEntityEmbedding with pytorch's nn.Embedding which implicitly one hot encodes categorical columns. This leads to a reduction in memory usage compared to the old version.

Types of changes

  • New feature (non-breaking change which adds functionality)

Motivation and Context

One hot encoding can lead to explosion in memory when the categories per column is high. Using nn.Embedding for such categorical columns will significantly reduce memory usage. Moreover, it is a more robust and simpler implementation of the embedding module. To do this, I have introduced a new pipeline step called ColumnSplitter (I am up for better name suggestions) which has min_values_for_embedding as a hyperparameter.
It also makes minor changes which optimise some parts of the library. These include

  1. We were loading the data from the datamanager and preprocessing some part of it when we needed the shape of the data after preprocessing. Loading data from disk is a time heavy process, and preprocessing even a small part of it is unnecessary. Now, the shape after preprocessing is passed from the EarlyPreprocessing node making it more efficient.
  2. remove self.categories from tabular feature validator which according to [memo] High memory consumption and the places of doubts #180 takes a lot of memory. We dont really need to store all the categories anyways we only need num_categories_per_col.

How has this been tested?

I have successfully run example_tabular_classification on Australian datasets where the default configuration allows us to verify the features introduced in this PR.

# allows us to pass embed_columns to the dataset properties.
# TODO: test the trade off
# Another solution is to combine `OneHotEncoding`, `Embedding` and `NoEncoding` in one custom transformer.
# this will also allow users to use this transformer outside the pipeline
Copy link
Contributor Author

@ravinkohli ravinkohli Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# this will also allow users to use this transformer outside the pipeline
# this will also allow users to use this transformer outside the pipeline, see [this](https://github.com/manujosephv/pytorch_tabular/blob/main/pytorch_tabular/categorical_encoders.py#L132)

@ravinkohli ravinkohli added the enhancement New feature or request label Mar 31, 2022
@ravinkohli ravinkohli linked an issue Apr 5, 2022 that may be closed by this pull request
Copy link
Collaborator

@theodorju theodorju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting, I reviewed the changes.

@@ -111,7 +111,7 @@ def send_warnings_to_log(
return prediction


def get_search_updates(categorical_indicator: List[bool]):
def get_search_updates(categorical_indicator: List[bool]) -> HyperparameterSearchSpaceUpdates:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method argument is not used, I believe it could be removed.

@@ -267,7 +267,8 @@ def __init__(

self.input_validator: Optional[BaseInputValidator] = None

self.search_space_updates = search_space_updates if search_space_updates is not None else get_search_updates(categorical_indicator)
# if search_space_updates is not None else get_search_updates(categorical_indicator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could also be removed.

self.logger.debug(f"run_summary_dict {json.dumps(run_summary_dict)}")
with open(os.path.join(self.backend.temporary_directory, 'run_summary.txt'), 'a') as file:
file.write(f"{json.dumps(run_summary_dict)}\n")
# self._write_run_summary(pipeline)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the functionality that was encapsulated in the function, I think this should be called here, right?

@@ -297,7 +296,7 @@ def _get_hyperparameter_search_space(self,
"""
raise NotImplementedError()

def _add_forbidden_conditions(self, cs):
def _add_forbidden_conditions(self, cs: ConfigurationSpace) -> ConfigurationSpace:
"""
Add forbidden conditions to ensure valid configurations.
Currently, Learned Entity Embedding is only valid when encoder is one hot encoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the chances introduced in the PR I think the first condition mentioned in the docstring regarding Learned Entity Embedding should be removed.

return self

def transform(self, X: Dict[str, Any]) -> Dict[str, Any]:
if self.num_categories_per_col is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.num_categories_per_col is initialized as an empty list, which means that it will not be None also for the encoded columns. Maybe this conditions should be changed to:

if self.num_categories_per_col:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be none when there were no categorical column, see line 38

Copy link
Collaborator

@theodorju theodorju Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but line 38 initializes self.num_categories_per_col to an empty list if there are categorical columns, and [] is not None returns True.

I'm mentioning this because I thought in line 53 we check if there are columns to be embedded, currently the if conditions evaluates to true both for embedded and encoded columns.

Comment on lines 123 to 124
# has_cat_features = any(categorical_indicator)
# has_numerical_features = not all(categorical_indicator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed.

@@ -19,69 +19,59 @@
class _LearnedEntityEmbedding(nn.Module):
""" Learned entity embedding module for categorical features"""

def __init__(self, config: Dict[str, Any], num_input_features: np.ndarray, num_numerical_features: int):
def __init__(self, config: Dict[str, Any], num_categories_per_col: np.ndarray, num_features_excl_embed: int):
"""
Args:
config (Dict[str, Any]): The configuration sampled by the hyperparameter optimizer
num_input_features (np.ndarray): column wise information of number of output columns after transformation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think num_input_features should be replaced with num_categories_per_col (np.ndarray): number of categories for categorical columns that will be embedded

@@ -289,6 +277,7 @@ def _get_pipeline_steps(
("imputer", SimpleImputer(random_state=self.random_state)),
# ("variance_threshold", VarianceThreshold(random_state=self.random_state)),
# ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)),
("column_splitter", ColumnSplitter(random_state=self.random_state)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docstring of the class should be updated to also include column_splitter as a step.

("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)),
# ("variance_threshold", VarianceThreshold(random_state=self.random_state)),
# ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)),
("column_splitter", ColumnSplitter(random_state=self.random_state)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as tabular_classification.py, it would be nice to add this step in the docstring as well.

ravinkohli and others added 2 commits July 16, 2022 17:31
…edding) (#437)

* add updates for apt1.0+reg_cocktails

* debug loggers for checking data and network memory usage

* add support for pandas, test for data passing, remove debug loggers

* remove unwanted changes

* :

* Adjust formula to account for embedding columns

* Apply suggestions from code review

Co-authored-by: nabenabe0928 <[email protected]>

* remove unwanted additions

* Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py

Co-authored-by: nabenabe0928 <[email protected]>
* reduce number of hyperparameters for pytorch embedding

* remove todos for the preprocessing PR, and apply suggestion from code review

* remove unwanted exclude in test
@ravinkohli
Copy link
Contributor Author

This branch will be merged to reg_cocktails. Therefore, this PR has been shifted to #451

@ravinkohli ravinkohli closed this Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace Embedding to use nn.Embedding from pytorch
2 participants