Data-mining library to driving electronic health records to compose a digital patient representation.
For building a general purpose patient phenotype representation we are developing a _diagnosenet_dataminig library to driving electronic health records to get a binary representative patient phenotype.
The implementation dataset is composed by diagnosis related group represented in object form; such as ICD-10 codes, CCAM codes and other codes established by the agency ATIH and generated by the system PMSI for Inpatients and Outpatients with records of hospitals activities in PACA and activities of residents PACA hospitalised in another region.
The objective of this library is to build a dynamic features composition with their respective vocabulary to build a binary patient phenotype representation in a 'document-term sparse matrix' from Electronic Health Records (EHR) data.
- _dIAgnoseNet_DataMinig library uses the following classes:
- diagnosenet_featurescomposition
- diagnosenet_vocabularycomposition
- diagnosenet_cdajson
- diagnosenet_labelcomposition
Main functions:
ICU_RSA = loadICUData():
load files used ar the intensive care unit (ICU) PMSI-PACA.cda_object = _get_FeaturesSerialized(ICU_RSA):
Get the clinical document architecture object JSON list for processing all terms (patient attributes) of feature group for each patient record.binary_PPR = _build_binaryPhenotype(cda_object):
Create a binary patient representation using Term-document Matrix According to features values selected in a cda_object and the type of vocabulary selected.
Serialised each patient record in a clinical document architecture schema (CDA) for getting a 'json_cda object' to build a separate vocabulary by each patient attribute and driving their binary patient phenotype representation. At the same time the cda schema is used to standardised the input Hospital data in PACA over years with different versions, in this first version was create for ICU-2008 or RSA-M24.
Is a 'class cdaJSON()' to define the clinical document in header and body, it is build in a JSON format. The first schema created was to ICU-2008 called 'cda.SchemaM24()', it is a function to serialise each record to create a clinical document in JSON format, example of this cda JSON schema:
cda_record = {
"x0_header": {"ID_RSA": [], "ID" : [], "hospital": [], "patient": [], "patient_Rol": [], "rsa_V": [], },
"x1_demographics": {"sexe": [], "age_group": [], "activity": [], "postal_code": [], },
"x2_admission_details": {"input_mode": [], "input_source": [], "previous_state": [], "first_week": [], },
"x3_hospitalization_details": {"numdays_hospitalized": [], "sequence_number": [], "surgery_time": [], },
"x4_physical_dependence": {"dressing": [], "displacement": [], "feeding": [], "continence": [], "wheelchair": [], },
"x5_cognitive_dependence": {"comportement": [], "communication": [], },
"x6_rehabilitation_time": {"mechanical_rehab":[], "motorsensory_rehab":[], "neuropsychological_rehab": [],
"cardiorespiratory_rehab": [], "nutritional_rehab": [], "urogenitalsphincter_rehab": [], "kidneys_rehab": [],
"electrical_equipment": [], "collective-rehab": [], "bilans": [], "physiotherapy": [], "balneotherapy": [], },
"x7_associated_diagnosis": {"Das_0": [], ..., "Das_19": [], },
"x8_primary_morbidity": {"care_purpose": [], "morbidity": [], "etiology": [], "major_clinical_category": [], },
"x9_clinical_procedures": {"Pro_0": [], ..., "Pro_k": [], },
"x10_destination": {"last_week": [], "output_mode": [], "destination": [], },
}
The cdaJSON used this class for get the rules especified for grouping values of 'patient features' to create an additional (engineered) features to improve the phenotype representation. As a firts rule example, was defined:
setAgeGroup(): {age_group: [0-6, 7-12, 13-17, 18-29, 30-59, 60-74] }
The clinical document in a JSON format 'clinical_document-2008.json' is writed into the first stage directory:
../healthData/sandbox-(name_EHR_db)/1_Mining-Stage/features_serialization/
This class is for building a Medical Vocabulary from feature composition groups PMSI-PACA (features_serializations) and for setting the vocabulary engineering in perspective of creating a general medical vocabulary.
Is the first one vocabulary built of the tree posibilites for indexing the clinical document architecture object list to create a fit terms-vocabulary from the ICUdata at training dataset.
For set the vocabulary we need to call the function _dynamic_Vocabulary
and deliver the clinical document JSON object cda_object
and define the corresponding features by entity.
_dynamic_Vocabulary(self, cda_object,
x1_name=None,y1_name=None,x3_name=None,x4_name=None,x5_name=None)
##Features by electronic health record entity
if not x1_name: x1_name = ['age','sexe','age_group','activity']
if not y1_name: y1_name = ['care_purpose','morbidity','etiology','major_category']
if not x3_name: x3_name = ['x3_related_diagnoses']
if not x4_name: x4_name = ['dressing', 'feeding', 'displacement', 'continence' ]
if not x5_name: x5_name = ['communication', 'comportement']
This class build a binary petient phenotype representation from their features selected. The rows correspond to patient phenotype (or profile) and the columns correspond to terms (or features).
This function creates a document-term matrix for each patient's feature. There are two kinf the features; the first one is when the feature is represent by one value such as {age: 56 || etiology: E10} and the second is whe the patient feature is represent by a list of features the same language such {related_diagnoses: {DAS1: E212, DAS2: E780, ...}}.
This fuction Create a binary corpus using Term-document Matrix, get features values from record_object to find their binary representation, get terms for each feature vocabulary, get the feature binary representation by each feature vocabulary.
This function call the class Vocabulary Composition to get the vocabulary and use this to create a binary patient phenotype representation for each patient and write this in a txt to be used in the following stage "unsupervised learning representation". is writed into the first stage directory ../healthData/sandbox-pre-trained/1_Mining-Stage/binary_representation/
. the small sample of this represenation is:
+ ID: 010780492000128701 +
+ x1: [[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1], [0, 0, 1, 0, 0, 0], [1, 0, 0]] +
+ y1: [[0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 1, 0]] +
+ x3: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] +
+ x4: [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0], [1, 0, 0]] +
+ x5: [[1, 0, 0], [1, 0, 0]] +
+ ID: 010780492000129401 +
+ x1: [[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1], [0, 0, 1, 0, 0, 0], [1, 0, 0]] +
+ y1: [[0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 1, 0]] +
+ x3: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] +
+ x4: [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0], [1, 0, 0]] +
+ x5: [[1, 0, 0], [1, 0, 0]] +
+ ID: 010780492000129402 +
+ x1: [[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1], [0, 0, 1, 0, 0, 0], [1, 0, 0]] +
+ y1: [[0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 1, 0]] +
+ x3: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] +
+ x4: [[1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0], [1, 0, 0]] +
+ x5: [[1, 0, 0], [1, 0, 0]] +
This class get the medical target from the CDA schema in a JSON format to build a one-hot vector representation.