Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Times should be within the range of event times to avoid exterpolation #138

Open
rvandewater opened this issue Nov 22, 2023 · 2 comments

Comments

@rvandewater
Copy link

Hi,

Thank you for creating this package.

I am encountering an error when using my own dataset for creating a survival regression model (see below). I am using the Survival Regression with Auton-Survival notebook with the cox proportional hazards model (see code below error). I am using a preprocessed dataset extracted from eICU with the max time value 168 for train, test, and val.

What I tried: when I try to replace the 168 in validation to 167 it gives me the same error. I checked the original example, and there seems to be the same situation that the max value in validation is equal to the same value in training; however, it does not throw an error here.

Thank you for your help.

  nonnumeric_cols = [col for (col, dtype) in df.dtypes.iteritems() if dtype.name == "category" or dtype.kind not in "biuf"]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[44], line 22
     20     # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
     21     predictions_val = model.predict_survival(x_val, times)
---> 22     metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
     23     models.append([metric_val, model])
     25 # Select the best model based on the mean metric value computed for the validation set

File ~/projects/auton-survival/auton_survival/metrics.py:215, in survival_regression_metric(metric, outcomes, predictions, times, outcomes_train, n_bootstrap, random_seed)
    211     outcomes_train = outcomes
    212     warnings.warn("You are are evaluating model performance on the \
    213 same data used to estimate the censoring distribution.")
--> 215   assert max(times) < outcomes_train.time.max(), "Times should \
    216 be within the range of event times to avoid exterpolation."
    217   assert max(times) <= outcomes.time.max(), "Times \
    218 must be within the range of event times."
    220   survival_train = util.Surv.from_dataframe('event', 'time', outcomes_train)

AssertionError: Times should be within the range of event times to avoid exterpolation.
from auton_survival.estimators import SurvivalModel
from auton_survival.metrics import survival_regression_metric
from sklearn.model_selection import ParameterGrid

# Define parameters for tuning the model
param_grid = {'l2' : [1e-3, 1e-4]}
params = ParameterGrid(param_grid)

# Define the times for model evaluation
times = np.quantile(y_tr['time'][y_tr['event']==1], np.linspace(0.1, 1, 10)).tolist()

# Perform hyperparameter tuning 
models = []
for param in params:
    model = SurvivalModel('cph', random_seed=2, l2=param['l2'])
    
    # The fit method is called to train the model
    model.fit(x_tr, y_tr)

    # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
    predictions_val = model.predict_survival(x_val, times)
    metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
    models.append([metric_val, model])
    
# Select the best model based on the mean metric value computed for the validation set
metric_vals = [i[0] for i in models]
first_min_idx = metric_vals.index(min(metric_vals))
model = models[first_min_idx][1]
@matteo4diani
Copy link
Contributor

matteo4diani commented Nov 24, 2023

Hi @rvandewater, thanks for contributing to auton-survival 🙂

Given a DeepCoxPH model trained on a survival dataset X_train, Y_train ~ features, (events, times) the min and max admissible times to compute the survival_regression_metric are, as you noted:

min_time = min(Y_train.times.values) + 1
max_time = max(Y_train.times.values) - 1

To avoid this problem you have three options:

  1. Apply an upper cut-off of max_time to your times
  2. Drop the last decile(s)
  3. Circumvent the problem and compute a static metric (shouldn't differ much from your average of time-dependent metrics) e.g. with sksurv.metrics.concordance_index_censored:
from sksurv import metrics
from auton_survival import DeepCoxPH
import torch

model = DeepCoxPH()

# ... train model ...

# Use model.torch_model[0] to access the `torch.nn.Module` that computes risk scores for DeepCox
# A better (and retro-compatible) API to access the PyTorch module will be available in the next updates 
with torch.inference_mode():
  model.torch_model[0].eval()
  
  X_test, Y_test = get_test_data()  

  risk_scores = model.torch_model[0](X_test)  

  concordance_index_censored = metrics.concordance_index_censored(
      Y_test.events.values.astype(bool),
      Y_test.times.values,
      risk_scores.squeeze(),
  )

I'm not sure if this satisfies your question, let me know if you need anything else

NB: I'm copying your code with syntax highlighting so it's easier to read (you can enable it by writing "```python" instead of " ```" at the start of the code block):

nonnumeric_cols = [col for (col, dtype) in df.dtypes.iteritems() if dtype.name == "category" or dtype.kind not in "biuf"]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[44], line 22
     20     # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
     21     predictions_val = model.predict_survival(x_val, times)
---> 22     metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
     23     models.append([metric_val, model])
     25 # Select the best model based on the mean metric value computed for the validation set

File ~/projects/auton-survival/auton_survival/metrics.py:215, in survival_regression_metric(metric, outcomes, predictions, times, outcomes_train, n_bootstrap, random_seed)
    211     outcomes_train = outcomes
    212     warnings.warn("You are are evaluating model performance on the \
    213 same data used to estimate the censoring distribution.")
--> 215   assert max(times) < outcomes_train.time.max(), "Times should \
    216 be within the range of event times to avoid exterpolation."
    217   assert max(times) <= outcomes.time.max(), "Times \
    218 must be within the range of event times."
    220   survival_train = util.Surv.from_dataframe('event', 'time', outcomes_train)

AssertionError: Times should be within the range of event times to avoid exterpolation.
from auton_survival.estimators import SurvivalModel
from auton_survival.metrics import survival_regression_metric
from sklearn.model_selection import ParameterGrid

# Define parameters for tuning the model
param_grid = {'l2' : [1e-3, 1e-4]}
params = ParameterGrid(param_grid)

# Define the times for model evaluation
times = np.quantile(y_tr['time'][y_tr['event']==1], np.linspace(0.1, 1, 10)).tolist()

# Perform hyperparameter tuning 
models = []
for param in params:
    model = SurvivalModel('cph', random_seed=2, l2=param['l2'])
    
    # The fit method is called to train the model
    model.fit(x_tr, y_tr)

    # Obtain survival probabilities for validation set and compute the Integrated Brier Score 
    predictions_val = model.predict_survival(x_val, times)
    metric_val = survival_regression_metric('ibs', y_val, predictions_val, times, y_tr)
    models.append([metric_val, model])
    
# Select the best model based on the mean metric value computed for the validation set
metric_vals = [i[0] for i in models]
first_min_idx = metric_vals.index(min(metric_vals))
model = models[first_min_idx][1]

@rvandewater
Copy link
Author

Hi @matteo4diani, thanks for your answer. I believe the manual cutting-off that you suggested was not even needed, but I replaced this line:

times = np.quantile(y_tr['time'][y_tr['event']==1], np.linspace(0.1, 1, 10)).tolist()

With this line:

times = np.quantile(y_val['time'][y_val['event']==1], np.linspace(0.1, 1, 10)).tolist()

The training data quantiles are validated within the code. I am not sure if this is intended like this as according to https://autonlab.org/auton-survival/metrics.html this should probably be based on the validation or test set and not the training set:

times : np.array
The time points at which to compute metric value(s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants