This is the working directory building and validating predictive models for Acute Kidney Injury (AKI) based on PCORNET common data model across GPC sites.
by Xing Song, with Mei Liu, Russ Waitman Medical Informatics Division, Univeristy of Kansas Medical Center
Copyright (c) 2018 Univeristy of Kansas Medical Center
Share and Enjoy according to the terms of the MIT Open Source License.
Acute Kidney Injury (AKI) is a common and highly lethal health problem, affecting 10-15% of all hospitalized patients and >50% of the intensive care unit (ICU) patients. In this application, we propose to build predictive models to identify patients at risk for hospital-acquired AKI and externally validate the models using the PCORnet (Patient Centered Outcomes Research Network)13 common data model (CDM) infrastructure (GPC#711,GPC#742). The project will be carried out with the following aims:
-
Aim 1 - KUMC: Building predictive models on single-site data. We will develop and internally cross-validate machine learning based predictive models for in-hospital AKI using electronic medical record (EMR) data from the University of Kansas Medical Center’s (KUMC) PCORnet CDM. As co-I of the PCORnet network Greater Plains Collaborative (GPC), PI of this project has direct access to the KUMC CDM for model development.
* Task 1.1: developing R-markdown implementation for data extraction and quality check
* Task 1.2: exploratory data analysis (e.g. strategies for data cleaning and representation, feature engineering)
* Task 1.3: benchmarking with replication of current state-of-art prediction model
* Task 1.4: developing new models -
Aim 2 - GPC: Validating predictive models on multi-site data. We will implement an automated analytic package with built in data extraction and predictive modeling from Aim 1 for distributed execution within two PCORnet clinical data research networks (CDRNs), namely GPC led by Dr. Waitman and Veterans Health Administration (VHA) site led by Dr. Matheny in pSCANNER. All prototyping will be done on the KUMC CDM.
* Task 2.1: deploying R-markdown implementation codes for quanlity check, and reporting results to KUMC
* Task 2.2: deploying R-markdown implementation codes or stand-along SQL codes to extract line-item data and transfer to KUMC
--- Alternatively, site can choose to re-use GROUSE data without performing any local extractions.
* Task 2.3 (distributed analytics): deploying R-markdown implementaion codes for external validations of predictive models ( Not fully tested yet )
-
Aim 3 - KUMC: Interpreting the AKI prediction model by analyzing and visualizing top important predictors
* Task 3.1: rank features based on their importance in improving model performance
* Task 3.2: use SHAP value to evaluate marginal effects of top important predictors
* Task 3.3: create dashboard for visualizing feature ranking and marginal plots -
Aim 4 - KUMC: Developing joint KL-divergence and adjMMD metrics to infer model transportability and potentially identify reasons imparing model transportability
* Task 4.1: identify overlapped and site-specific feature space
* Task 4.2: develop metrics, joint KL-divergence and adjMMD, to quantify joint distribution hetergenity, considering missing value patterns
* Task 4.3: invetigate the usability of the new metrics
For each patient in the data set, we collected all variables that are currently populated in the PCORnet CDM schema over both source and target sites, which includes:
- general demographic details (i.e., age, gender and race),
- all structured clinical variables that are currently supported by PCORnet CDM Version 4,
- including diagnoses (ICD-9 and ICD-10 codes), procedures (ICD and CPT codes),
- lab tests (LOINC codes), medications (RXNORM and NDC codes),
- selected vital signs (e.g., blood pressure, height, weight, BMI)30.
All variables are time-stamped and every patient in the dataset was represented by a sequence of clinical events construed by clinical observation vectors aggregated on daily basis.
The initial feature set contained more than 30,000 distinct features. We performed an automated curation process as follows:
- systematically identified extreme values of numerical variables (e.g., lab test results and vital signs) that are beyond 1st and 99th percentile as outliers and removed them;
- performed one-hot-coding on categorical variables (e.g., diagnosis and procedure codes) to convert them into binary representations;
- used the cumulative-exposure-days of medications as predictors instead of a binary indicator for the sheer existence of that medication;
- when repeated measurements presented within certain time interval, we chose the most recent value;
- when measurements are missing for a certain time interval, we performed a common sampling practice called sample-and-hold which carried the earlier available observation over;
- introduced additional features such as lab value changes since last observation or daily blood pressure trends, which have been shown to be predictive of AKI10.
The discrete-time survival model (DTSA) required converting the encounter-level data into an Encounter-Period data set with discrete time interval indicator (i.e. day1, day2, day3,...). More details about this conversion can be found in the format_data()
and get_dsurv_temporal()
functions from /R/util.R
. As shown in the figure below: ,
AKI patient at days of AKI onset contributed to positive outcomes, while earlier non-AKI days of AKI patients as well as daily outcomes of truely non-AKI patients (i.e. who never progressed to any stage of AKI) contributed to nefative outcomes.
In order for sites to extract AKI cohort, run predictive models and generate final report, the following infrastructure requirement must be satisfied:
R program: R Program (>=3.3.0) is required and R studio (>= 1.0.136) is preferred to be installed as well for convenient report generation.
DBMS connection: Valid channel should also be established between R and DBMS so that communication between R and CDM database can be supported.
Dependencies: A list of core R packages as well as their dependencies are required. However, their installations have been included in the codes.
- DBI (>=0.2-5): for communication between R and relational database
- ROracle (>=1.3-1): an Oracle JDBC driver
- RJDBC: a SQL sever driver
- RPostgres: a Postgres driver
- rmarkdown (>=1.10): for rendering report from .Rmd file (Note: installation may trip over dependencies digest and htmltools (>=0.3.5), when manually installation is required).
- dplyr (>=0.7.5): for efficient data manipulation
- tidyr (>=0.8.1): for efficient data manipulation
- magrittr (>=1.5): to enable pipeline operation
- stringr (>=1.3.1): for handling strings
- knitr (>=1.11): help generate reports
- kableExtra: for generating nice tables
- ggplot2 (>=2.2.1): for generating nice plots
- ggrepel: to avoid overlapping labels in plots
- openxlsx (>=4.1.0): to save tables into multiple sheets within a single .xlsx file
- RCurl: for linkable descriptions (when uploading giant mapping tables are not feasible)
- XML: for linkable descriptions (when uploading giant mapping tables are not feasible)
- xgboost: for effectively training the gradient boosting machine
- pROC: for calculating receiver operating curve
- PRROC: for calculating precision recall curve
- ParBayesianOptimization: a parallel implementation for baysian optimizaion for xgboost
- doParallel: provide backend for parallelization
The following instructions are for extracting cohort and generating final report from a DBMS
data source (specified by DBMS_type
) (available options are: Oracle, tSQL, PostgreSQL(not yet))
- Get
AKI_CDM
code
- download the AKI_CDM repository as a .zip file, unzip and save folder as
path-to-dir/AKI_CDM
OR - clone AKI_CDM repository (using git command):
i) navigate to the local directorypath-to-dir
where you want to save the project repository and
type command line:$ cd <path-to-dir>
ii) clone the AKI_CDM repository by typing command line:$ git clone https://github.com/kumc-bmi/AKI_CDM
-
Prepare configeration file
config.csv
and save in the AKI_CDM project folder
i) download theconfig_<DBMS_type>_example.csv
file according toDBMS_type
ii) fill in the content accordingly (or you can manually create the file using the following format)username password access cdm_db_name/sid cdm_db_schema temp_db_schema your_username your_passwd host:port database name(tSQL)/SID(Oracle) current CDM schema default schema iii) save as
config.csv
under the same directory
-
Extract AKI cohort and generate final report
i) setup working directory
- In r-studio environment, simply open R projectAKI_CDM.Rproj
within the folder OR
- In plain r environment, set working directory to whereAKI_CDM
locates by runingsetwd("path-to-dir/AKI_CDM")
ii) edit r script
render_report.R
by specifying the following parameters:
-which_report
: which report you want to render (default is./report/AKI_CDM_EXT_VALID_p1_QA.Rmd
, but there will be more options in the future)
-DBMS_type
: what type of database the current CDM is built on (available options are:Oracle
(default),tSQL
)
-driver_type
: what type of database connection driver is available (available options are:OCI
orJDBC
for oracle;JDBC
for sql server)
-start_date
,end_date
: the start and end date of inclusion period, in"yyyy-mm-dd"
format (e.g. "2010-01-01", with the quotes)iii) run Part I of r script
render_report.R
after assigning correct values to the parameters in ii)iv) collect and report all output files from
/output
folder
-- a. AKI_CDM_EXT_VALID_p1_QA.html - html report with description, figures and partial tables
-- b. AKI_CDM_EXT_VALID_p1_QA_TBL.xlsx - excel with full summary tables
Remark: all the counts (patient, encounter, record) are masked as "<11" if the number is below 11
Part II: Validate Existing Predictive Models and Retrain Predictive Models with Local Data (Not fully tested yet)
-
Validate the given predictive model trained on KUMC's data
i) download the predictive model package, "AKI_model_kumc.zip", from the securefile link shared by KUMC. Unzip the file and save everything under
./data/model_kumc
(remark: make sure to save the files under the correct directory, as they will be called later using the corresponding path)ii) continue to run Part II.0 of the r script
render_report.R
after completing Part I. Part II.0 will only depend on tables already extracted from Part I (saved locally in the folder./data/raw/...
), no parameter needs to be set up.iii) continue to run Part II.1 of the r script
render_report.R
after completing Part II.0. Part II.1 will only depend on tables already extracted from Part II.0 (saved locally in the folder./data/preproc/...
), no parameter needs to be set up.iv) collect and report the two new output files from
/output
folder
-- a. AKI_CDM_EXT_VALID_p2_1_Benchmark.html - html report with description, figures and partial tables
-- b. AKI_CDM_EXT_VALID_p2_1_Benchmark_TBL.xlsx - excel with full summary tables -
Retrain the model using local data and validate on holdout set
i) download the data dictionary, "feature_dict.csv", from the securefile link shared by KUMC and save the file under "./ref/" (remark: make sure to save the file under the correct directory, as it will be called later using the corresponding path)
ii) continue to run (optionally, if already run for Part II.1) Part II.0 of the r script
render_report.R
after completing Part I. Part II.0 will only depend on tables already extracted from Part I (saved locally in the folder./data/raw/...
), no parameter needs to be set up.iii) continue to run Part II.2 of the r script
render_report.R
after completing Part I. Part II.2 will only depend on tables already extracted from Part II.0 (saved locally in the folder./data/preproc/...
), no parameter needs to be set up.iv) collect and report the two new output files from
/output
folder
-- a. AKI_CDM_EXT_VALID_p2_2_Retrain.html - html report with description, figures and partial tables
-- b. AKI_CDM_EXT_VALID_p2_2_Retrain_TBL.xlsx - excel with full summary tables
*Remark: As along as Part I is completed, Part II.1 and Part II.2 can be run independently, based on each site's memory and disk availability.
Run the distribution_analysis.R
script to calculate the adjMMD and joint KL-divergence of distribution hetergenity among top important variables of each model. adjMMD is an effective metric which can be used to assess and explain model transportability.
a. It takes about 2 ~ 3 hours to complete Part I (AKI_CDM_EXT_VALID_p1_QA.Rmd). At peak time, it will use about 30 ~ 40GB memory, especially when large tables like Precribing or Lab tables are loaded in. Total size of output for Part I is about 6MB.
b. It takes about 60 ~ 70 hours (6hr/task) to complete Part II.0 (AKI_CDM_EXT_VALID_p2_0_Preprocess.Rmd). At peak time, it will use about 50 ~ 60GB memory, especially at the preprocessing stage. Total size of intermediate tables and output for Part II.0 is about 2GB.
c. It takes about 25 ~ 30 hours to complete Part II.1 (AKI_CDM_EXT_VALID_p2_Benchmark.Rmd). At peak time, it will use about 30 ~ 40GB memory, especially at the preprocessing stage. Total size of intermediate tables and output for Part II.1 is about 600MB.
d. It takes about 40 ~ 50 hours to complete Part II.2 (AKI_CDM_EXT_VALID_p2_Retrain.Rmd). At peak time, it will use about 30 ~ 40GB memory, especially at the preprocessing stage. Total size of intermediate tables and output for Part II.2 is about 800MB.
We used Shapely Additive exPlanations (SHAP) values to evaluate the marginal effects of the shared top important variables of interests 34. Specifically, the SHAP values evaluated how the logarithmic odds ratio changed by including a factor of certain value for each individual patient. The SHAP values not only captured the global patterns of effects of each factor but also demonstrated the patient-level variations of the effects. We also estimated 95% bootstrapped confidence intervals of SHAP values for each selected feature based on 100 bootstrapped samples. Visit our SHAP value dashboard for more details.
The Maximum Mean Discrepancy (MMD) has been widely used in transfer learning studies for maximizing the similarity among distributions of different domains. Here we modified the classic MMD by taking the missing pattern and feature importance into consideration, which is used to measure the similarities of distributions for the same feature between training and validation sites. Visit our paper Cross-Site Transportability of an Explainable Artificial Intelligence Model for Acute Kidney Injury Prediction (DOI:10.1038/s41467-020-19551-w) for more details.
updated 11/10/2020