From ce19566e702bd9bba7f89898c43a14634e54a0e4 Mon Sep 17 00:00:00 2001 From: Jay Gil <55799578+JayGilDS@users.noreply.github.com> Date: Tue, 9 Sep 2025 20:49:42 -0300 Subject: [PATCH 1/3] Add files via upload Add new demo notebook. It features ruled based models. --- .../rule_based_ner_and_assertion_models.ipynb | 856 ++++++++++++++++++ 1 file changed, 856 insertions(+) create mode 100644 tutorials/streamlit_notebooks/healthcare/rule_based_ner_and_assertion_models.ipynb diff --git a/tutorials/streamlit_notebooks/healthcare/rule_based_ner_and_assertion_models.ipynb b/tutorials/streamlit_notebooks/healthcare/rule_based_ner_and_assertion_models.ipynb new file mode 100644 index 000000000..138de3b2e --- /dev/null +++ b/tutorials/streamlit_notebooks/healthcare/rule_based_ner_and_assertion_models.ipynb @@ -0,0 +1,856 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "eVLAFDcYwzCs" + }, + "source": [ + "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lGu6GyR8uTmE" + }, + "source": [ + "# Rule Based NER and Assertion Models\n", + "This notebook demonstrates the use of rule-based Named Entity Recognition (NER) combined with assertion detection models for structured information extraction from text.\n", + "Here we demo a series of Text Matcher models, each designed to identify and extract entities of interest, such as states, cities, and drugs, using pre-defined dictionaries and linguistic patterns. By applying these targeted matchers, we can ensure high precision in entity identification, especially in specialized contexts where standard models may underperform.\n", + "\n", + "Beyond entity detection, the notebook also integrates Contextual Assertion Models, which determine the status of an entity in context. For example, whether a drug is mentioned as being possibly used (Detect Possible Assertion) or conditionally prescribed (Detect Conditional Assertion).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F3y-4CRcw-GS" + }, + "source": [ + "## **🎬 Colab Setup**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ltXTfuG-WZ-B" + }, + "outputs": [], + "source": [ + "# import johnsnowlabs library\n", + "!pip install -q johnsnowlabs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MATjEDqWW3qL" + }, + "outputs": [], + "source": [ + "# Upload license key for healthcare NLP\n", + "from google.colab import files\n", + "print('Please Upload your John Snow Labs License using the button below')\n", + "license_keys = files.upload()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xWpo7EWIy7Zm" + }, + "outputs": [], + "source": [ + "from johnsnowlabs import nlp, medical\n", + "nlp.settings.enforce_versions=True\n", + "nlp.install(refresh_install=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hATfV8oIW53Q" + }, + "outputs": [], + "source": [ + "# import Spark NLP and Spark NLP for Healthcare from johnsnowlabs library\n", + "from johnsnowlabs import nlp, medical\n", + "\n", + "nlp.install()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uWM6-MHhXPSL" + }, + "outputs": [], + "source": [ + "# import required modules\n", + "from sparknlp.base import *\n", + "from pyspark.ml import Pipeline\n", + "\n", + "spark = nlp.start()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M3mzYaSTyeqY" + }, + "source": [ + "# 🔎 MODELS\n", + "Models used in this pipeline and the entities they extract." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2H9XO0i9ycNQ" + }, + "source": [ + "| Index | Model | Entities |\n", + "|---:|:------------------------|:-|\n", + "| 1 | [country_matcher](https://nlp.johnsnowlabs.com/2024/10/23/country_matcher_en.html) | Country |\n", + "| 2 | [state_matcher](https://nlp.johnsnowlabs.com/2024/09/11/state_matcher_en.html) | State |\n", + "| 3 | [city_matcher](https://nlp.johnsnowlabs.com/2024/07/02/city_matcher_en.html) | City |\n", + "| 4 | [drug_matcher](https://nlp.johnsnowlabs.com/2024/03/19/drug_matcher_en.html) | Drug |\n", + "| 5 | [biomarker_matcher](https://nlp.johnsnowlabs.com/2024/03/06/biomarker_matcher_en.html) | Biomarker |\n", + "| 6 | [cancer_diagnosis_matcher](https://nlp.johnsnowlabs.com/2024/06/17/cancer_diagnosis_matcher_en.html) | Cancer_dx |\n", + "| 7 | [contextual_assertion_conditional](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_conditional_en.html) | Conditional |\n", + "| 8 | [contextual_assertion_possible](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_possible_en.html) | Possible |\n", + "| 9 | [contextual_assertion_someone_else](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_someone_else_en.html) | Someone_else |\n", + "| 10 | [contextual_assertion_absent](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_absent_en.html) | Absent |\n", + "| 11 | [contextual_assertion_past](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_past_en.html) | Past |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZrTceG5Lrwcl" + }, + "source": [ + "# Rule-based Pipeline with Separated Entity Processing\n", + "This pipeline combines rule-based NER models and assertion models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zEoZUN40M8lK", + "outputId": "ee979008-e618-4374-fc9f-5e1cfbc39283" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "sentence_detector_dl_healthcare download started this may take some time.\n", + "Approximate size to download 367.3 KB\n", + "[OK!]\n", + "country_matcher download started this may take some time.\n", + "Approximate size to download 10.2 KB\n", + "[OK!]\n", + "state_matcher download started this may take some time.\n", + "Approximate size to download 6.1 KB\n", + "[OK!]\n", + "city_matcher download started this may take some time.\n", + "Approximate size to download 180.3 KB\n", + "[OK!]\n", + "drug_matcher download started this may take some time.\n", + "Approximate size to download 5.5 MB\n", + "[OK!]\n", + "biomarker_matcher download started this may take some time.\n", + "Approximate size to download 25.6 KB\n", + "[OK!]\n", + "cancer_diagnosis_matcher download started this may take some time.\n", + "Approximate size to download 42.8 KB\n", + "[OK!]\n", + "contextual_assertion_conditional download started this may take some time.\n", + "Approximate size to download 1.3 KB\n", + "[OK!]\n", + "contextual_assertion_possible download started this may take some time.\n", + "Approximate size to download 1.7 KB\n", + "[OK!]\n", + "contextual_assertion_someone_else download started this may take some time.\n", + "Approximate size to download 1.5 KB\n", + "[OK!]\n", + "contextual_assertion_absent download started this may take some time.\n", + "Approximate size to download 1.3 KB\n", + "[OK!]\n", + "contextual_assertion_past download started this may take some time.\n", + "Approximate size to download 1.5 KB\n", + "[OK!]\n" + ] + } + ], + "source": [ + "document_assembler = nlp.DocumentAssembler()\\\n", + " .setInputCol(\"text\")\\\n", + " .setOutputCol(\"document\")\n", + "\n", + "sentence_detector = nlp.SentenceDetectorDLModel.pretrained(\"sentence_detector_dl_healthcare\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"document\"])\\\n", + " .setOutputCol(\"sentence\")\n", + "\n", + "tokenizer = nlp.Tokenizer()\\\n", + " .setInputCols([\"sentence\"])\\\n", + " .setOutputCol(\"token\")\n", + "\n", + "country_matcher = medical.TextMatcherModel.pretrained(\"country_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"country\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "state_matcher = medical.TextMatcherModel.pretrained(\"state_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"state\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "city_matcher = medical.TextMatcherModel.pretrained(\"city_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"city\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "drug_matcher = medical.TextMatcherModel.pretrained(\"drug_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"drug\")\n", + "\n", + "biomarker_matcher = medical.TextMatcherModel.pretrained(\"biomarker_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"biomarker\")\n", + "\n", + "cancer_diagnosis_matcher = medical.TextMatcherModel.pretrained(\"cancer_diagnosis_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"cancer_dx\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "# Merge all NER entities\n", + "chunk_merger = medical.ChunkMergeApproach()\\\n", + " .setInputCols([\"drug\", \"biomarker\", \"cancer_dx\",\"country\", \"state\", \"city\"])\\\n", + " .setOutputCol(\"ner_chunk\")\\\n", + " .setSelectionStrategy(\"Sequential\")\n", + "\n", + "# Merge clinical entities (for assertions)\n", + "clinical_merger = medical.ChunkMergeApproach()\\\n", + " .setInputCols([\"drug\", \"biomarker\", \"cancer_dx\"])\\\n", + " .setOutputCol(\"clinical_entities\")\\\n", + " .setSelectionStrategy(\"DiverseLonger\")\\\n", + " .setOrderingFeatures([\"ChunkLength\"])\n", + "\n", + "# Assertion models (only for clinical entities)\n", + "contextual_assertion_conditional = medical.ContextualAssertion.pretrained(\"contextual_assertion_conditional\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_conditional\")\n", + "\n", + "contextual_assertion_possible = medical.ContextualAssertion.pretrained(\"contextual_assertion_possible\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_possible\")\n", + "\n", + "contextual_assertion_someone_else = medical.ContextualAssertion.pretrained(\"contextual_assertion_someone_else\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_someone_else\")\n", + "\n", + "contextual_assertion_absent = medical.ContextualAssertion.pretrained(\"contextual_assertion_absent\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_absent\")\n", + "\n", + "contextual_assertion_past = medical.ContextualAssertion.pretrained(\"contextual_assertion_past\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_past\")\n", + "\n", + "assertion_merger = medical.AssertionMerger()\\\n", + " .setInputCols([\"assertion_conditional\", \"assertion_possible\", \"assertion_someone_else\", \"assertion_absent\", \"assertion_past\"])\\\n", + " .setOutputCol(\"clinical_assertions\")\\\n", + " .setMergeOverlapping(True)\\\n", + " .setSelectionStrategy(\"sequential\")\\\n", + " .setAssertionSourcePrecedence(\"assertion_conditional, assertion_possible, assertion_someone_else, assertion_absent, assertion_past\")\\\n", + " .setCaseSensitive(False)\n", + "\n", + "pipeline = nlp.Pipeline(stages=[\n", + " document_assembler,\n", + " sentence_detector,\n", + " tokenizer,\n", + " country_matcher,\n", + " state_matcher,\n", + " city_matcher,\n", + " drug_matcher,\n", + " biomarker_matcher,\n", + " cancer_diagnosis_matcher,\n", + " chunk_merger,\n", + " clinical_merger,\n", + " contextual_assertion_conditional,\n", + " contextual_assertion_possible,\n", + " contextual_assertion_someone_else,\n", + " contextual_assertion_absent,\n", + " contextual_assertion_past,\n", + " assertion_merger\n", + "])\n" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Fit the Pipeline" + ], + "metadata": { + "id": "NTJoOM-hhGNm" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DF_wapDLOB6j" + }, + "outputs": [], + "source": [ + "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", + "fitted_pipeline = pipeline.fit(empty_data)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Sample Text to Tryout the Pipeline" + ], + "metadata": { + "id": "ZMS3oL1bhLhz" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q3jvHLA9lcFZ" + }, + "outputs": [], + "source": [ + "sample_texts = [\n", + " \"\"\"Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093\n", + "Dr. Sofia Chen, IP: 172.16.254.12\n", + "She is a 48-year-old female admitted to Unity Health Institute in Toronto\n", + "for thyroidectomy on 11/03/95.\n", + "Patient's VIN: JH4KA8270MC012345, SSN: 333-22-7777, Driver’s License: P987654F\n", + "Phone: +1 (647) 555-1122, Address: 789 Queen Street, Toronto, Canada,\n", + "Email: rina.patel@caremail.org\n", + "In the past 18 months, the patient has traveled to India, Germany, Brazil,\n", + "South Korea, Morocco, and Australia for both business and leisure.\n", + "She reported brief stays in Mexico City and Cairo as well.\n", + "All travel occurred prior to surgery, and she denied any symptoms during or after her trips.\"\"\",\n", + "\n", + " \"\"\"Patient Summary Report\n", + "Name: Green, Thomas L.  DOB: 08/14/2040  Sex: Male  MRN: 559882\n", + "Date of Encounter: 2094-08-30  Facility: St. Margaret’s Medical Center, Atlanta, Georgia\n", + "Physician: Dr. Rebecca Allen, MD – Internal Medicine\n", + "Chief Complaint: Persistent abdominal pain and fatigue for 2 weeks.\n", + "History of Present Illness: Mr. Green is a 54-year-old male who presented\n", + "to the emergency department in Atlanta, GA, with abdominal discomfort\n", + "described as a dull ache localized to the left lower quadrant.\n", + "He reports the pain began during a work trip to Texas and progressively\n", + "worsened while traveling through Nevada and Illinois.\n", + "The patient states he had similar episodes in the past during visits to Florida,\n", + "but those resolved spontaneously. He recently returned from a family reunion\n", + "in New York, where he experienced nausea and loss of appetite.\"\"\",\n", + "\n", + " \"\"\"Name: Laura Martinez  Record Date: 2094-06-15  MR: 927384\n", + "Dr. Anthony Kim, IP: 10.0.0.45\n", + "She is a 62-year-old female admitted to Metropolitan Medical Center\n", + "in San Francisco for a knee replacement on 06/15/94.\n", + "Patient's VIN: 1N4AL11D75C678901, SSN: 555-66-9999, Driver's license no: D456321K\n", + "Phone: (415) 555-6723, 1122 Pine Avenue, Chicago, IL, USA,\n", + "E-mail: laura.martinez@healthmail.org\n", + "Patient has traveled to Rome, Dubai, and Cape Town in the past 12 months.\"\"\",\n", + "\n", + " \"\"\"Maria’s physician prescribed clopidogrel for her cardiovascular risk,\n", + "along with ibuprofen for muscle pain, azithromycin for her sinus infection,\n", + "and omeprazole to manage her acid reflux on 2024-07-18.\"\"\",\n", + "\n", + " \"\"\"In the bone marrow (BM) aspirate, blasts comprised 91.3% of nucleated cells,\n", + "expressing CD10, CD19, CD34, CD45, CD117, CD123, HLA-DR, and TdT by flow cytometric analysis.\n", + "Serum tumor marker evaluation revealed elevated levels of carcinoembryonic antigen (CEA: 6.42 ng/mL),\n", + "alpha-fetoprotein (AFP: 11.75 ng/mL), and pro-gastrin-releasing peptide (ProGRP: 85.3 pg/mL).\"\"\",\n", + "\n", + " \"\"\"A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy,\n", + "total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma\n", + "(mucinous-type carcinoma, stage Ic) 1 year ago. The patient's medical compliance was poor and failed\n", + "to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2).\n", + "Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast\n", + "in 2 months. Core needle biopsy revealed metaplastic carcinoma.\n", + "Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2),\n", + "and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response,\n", + "followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting.\n", + "Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions.\n", + "The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation\n", + "associated with adenomyoepithelioma.\n", + "Immunohistochemistry study showed that the tumor cells are positive for epithelial markers\n", + "(cytokeratin AE1/AE3), and myoepithelial markers, including CK 5/6, p63, and S100.\n", + "Expressions of hormone receptors, including ER, PR, and Her-2/Neu, were all negative.\"\"\",\n", + "\n", + " \"\"\"Patient has a family history of diabetes. Father diagnosed with heart failure last year.\n", + "Sister and brother both have asthma. Grandfather had cancer in his late 70s.\n", + "No known family history of substance abuse. Family history of autoimmune diseases is also noted.\"\"\",\n", + "\n", + " \"\"\"Patient resting in bed. Patient given azithromycin without any difficulty.\n", + "Patient has audible wheezing, states chest tightness.\n", + "No evidence of hypertension. Patient denies nausea at this time. Zofran declined.\n", + "Patient is also having intermittent sweating associated with pneumonia.\"\"\",\n", + "\n", + " \"\"\"The patient presents with symptoms suggestive of pneumonia, including fever, productive cough,\n", + "and mild dyspnea. Chest X-ray findings are compatible with a possible early-stage infection,\n", + "though bacterial pneumonia cannot be entirely excluded.\"\"\",\n", + "\n", + " \"\"\"The patient reports intermittent chest pain when engaging in physical activity,\n", + "particularly on exertion. Symptoms appear to be contingent upon increased stress levels and heavy meals.\"\"\",\n", + "\n", + " \"\"\"History of Present Illness: The patient reports a history of influenza with high fever\n", + "(up to 41 °C) approximately two months ago. He now presents again with flu-like symptoms,\n", + "including fever, but denies productive cough.\n", + "Family History: Father with a history of lung cancer.\"\"\"\n", + "]\n" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Apply the Pipeline to Sample Texts" + ], + "metadata": { + "id": "NydYOOBEhYoF" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9o6SeibHWssx" + }, + "outputs": [], + "source": [ + "data = spark.createDataFrame([[text] for text in sample_texts]).toDF(\"text\")\n", + "result = fitted_pipeline.transform(data)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Print the Results for NER" + ], + "metadata": { + "id": "tvklRLZihkP0" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "laI4cuMZT_5N", + "outputId": "8ecc2e81-83c3-4c73-cef7-8456f5c51f90" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "NER entities\n", + "+-----------------------------+-----+----+---------+\n", + "|result |begin|end |entity |\n", + "+-----------------------------+-----+----+---------+\n", + "|Toronto |155 |161 |City |\n", + "|Toronto |326 |332 |City |\n", + "|Canada |335 |340 |COUNTRY |\n", + "|India |425 |429 |COUNTRY |\n", + "|Germany |432 |438 |COUNTRY |\n", + "|Brazil |441 |446 |COUNTRY |\n", + "|South Korea |449 |459 |COUNTRY |\n", + "|Morocco |462 |468 |COUNTRY |\n", + "|Australia |475 |483 |COUNTRY |\n", + "|Mexico |544 |549 |COUNTRY |\n", + "|Cairo |560 |564 |City |\n", + "|Thomas |36 |41 |DRUG |\n", + "|Atlanta |159 |165 |City |\n", + "|Georgia |168 |174 |COUNTRY |\n", + "|Atlanta |402 |408 |City |\n", + "|Texas |552 |556 |STATE |\n", + "|Nevada |609 |614 |STATE |\n", + "|Illinois |620 |627 |STATE |\n", + "|Florida |702 |708 |STATE |\n", + "|reunion |780 |786 |COUNTRY |\n", + "|New York |791 |798 |STATE |\n", + "|Laura |6 |10 |DRUG |\n", + "|San Francisco |160 |172 |City |\n", + "|Chicago |333 |339 |City |\n", + "|USA |346 |348 |COUNTRY |\n", + "|laura |359 |363 |DRUG |\n", + "|Rome |413 |416 |City |\n", + "|Dubai |419 |423 |City |\n", + "|clopidogrel |29 |39 |DRUG |\n", + "|ibuprofen |81 |89 |DRUG |\n", + "|azithromycin |108 |119 |DRUG |\n", + "|omeprazole |150 |159 |DRUG |\n", + "|CD10 |88 |91 |Biomarker|\n", + "|CD19 |94 |97 |Biomarker|\n", + "|CD34 |100 |103 |Biomarker|\n", + "|CD45 |106 |109 |Biomarker|\n", + "|CD117 |112 |116 |Biomarker|\n", + "|CD123 |119 |123 |Biomarker|\n", + "|HLA-DR |126 |131 |Biomarker|\n", + "|TdT |138 |140 |Biomarker|\n", + "|carcinoembryonic antigen |229 |252 |Biomarker|\n", + "|CEA |255 |257 |Biomarker|\n", + "|alpha-fetoprotein |273 |289 |Biomarker|\n", + "|AFP |292 |294 |Biomarker|\n", + "|pro-gastrin-releasing peptide|315 |343 |Biomarker|\n", + "|ProGRP |346 |351 |Biomarker|\n", + "|ovarian carcinoma |175 |191 |Cancer_dx|\n", + "|mucinous-type carcinoma |194 |216 |Cancer_dx|\n", + "|cyclophosphamide |324 |339 |DRUG |\n", + "|carboplatin |352 |362 |DRUG |\n", + "|metaplastic carcinoma |526 |546 |Cancer_dx|\n", + "|Taxotere |595 |602 |DRUG |\n", + "|Epirubicin |616 |625 |DRUG |\n", + "|Cyclophosphamide |643 |658 |DRUG |\n", + "|metaplastic carcinoma |935 |955 |Cancer_dx|\n", + "|adenomyoepithelioma |1003 |1021|Cancer_dx|\n", + "|cytokeratin AE1/AE3 |1116 |1134|Biomarker|\n", + "|myoepithelial markers |1142 |1162|Biomarker|\n", + "|CK 5/6 |1175 |1180|Biomarker|\n", + "|p63 |1183 |1185|Biomarker|\n", + "|S100 |1192 |1195|Biomarker|\n", + "|ER |1242 |1243|Biomarker|\n", + "|PR |1246 |1247|Biomarker|\n", + "|Her-2/Neu |1254 |1262|Biomarker|\n", + "|cancer |142 |147 |Cancer_dx|\n", + "|azithromycin |38 |49 |DRUG |\n", + "|Zofran |194 |199 |DRUG |\n", + "|lung cancer |264 |274 |Cancer_dx|\n", + "+-----------------------------+-----+----+---------+\n", + "\n" + ] + } + ], + "source": [ + "# Print results for all NER entities\n", + "print(\"NER entities\")\n", + "result.selectExpr(\"explode(ner_chunk)\").select(\"col.result\", \"col.begin\", \"col.end\", \"col.metadata.entity\").show(100, truncate=False)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Print Results for Assertions" + ], + "metadata": { + "id": "ZXdXHRvLhu8J" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "07E49VsIUXNZ", + "outputId": "c0f7f892-75cd-4875-85d4-3a05b487fd90" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Assertions\n", + "+-----------------------+-----+----+----------------------------+\n", + "|ner_chunk |begin|end |result |\n", + "+-----------------------+-----+----+----------------------------+\n", + "|ovarian carcinoma |175 |191 |Past |\n", + "|mucinous-type carcinoma|194 |216 |Past |\n", + "|cyclophosphamide |324 |339 |Past |\n", + "|carboplatin |352 |362 |Past |\n", + "|Taxotere |595 |602 |Past |\n", + "|Epirubicin |616 |625 |Past |\n", + "|Cyclophosphamide |643 |658 |Past |\n", + "|metaplastic carcinoma |935 |955 |conditional |\n", + "|cytokeratin AE1/AE3 |1116 |1134|Past |\n", + "|myoepithelial markers |1142 |1162|Past |\n", + "|CK 5/6 |1175 |1180|Past |\n", + "|p63 |1183 |1185|Past |\n", + "|S100 |1192 |1195|Past |\n", + "|ER |1242 |1243|absent |\n", + "|PR |1246 |1247|absent |\n", + "|Her-2/Neu |1254 |1262|absent |\n", + "|cancer |142 |147 |associated_with_someone_else|\n", + "|Zofran |194 |199 |absent |\n", + "|lung cancer |264 |274 |associated_with_someone_else|\n", + "+-----------------------+-----+----+----------------------------+\n", + "\n" + ] + } + ], + "source": [ + "# Assertions (only for clinical entities: Drug, Biomarker, Cancer_dx)\n", + "print(\"Assertions\")\n", + "result.selectExpr(\"explode(clinical_assertions)\").select(\"col.metadata.ner_chunk\", \"col.begin\", \"col.end\", \"col.result\").show(100, truncate=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rCPFs8I5mmEy" + }, + "source": [ + "# Entity and Assertion Visualization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "MYELOqmlgiD9", + "outputId": "2d8b1cc6-dba9-4124-cd6f-025d9d0104c0" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Ner Result Entities \n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "\n", + " Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093
Dr. Sofia Chen, IP: 172.16.254.12
She is a 48-year-old female admitted to Unity Health Institute in
Toronto City
for thyroidectomy on 11/03/95.
Patient's VIN: JH4KA8270MC012345, SSN: 333-22-7777, Driver’s License: P987654F
Phone: +1 (647) 555-1122, Address: 789 Queen Street,
Toronto City, Canada COUNTRY,
Email: rina.patel@caremail.org
In the past 18 months, the patient has traveled to
India COUNTRY, Germany COUNTRY, Brazil COUNTRY,
South Korea COUNTRY, Morocco COUNTRY, and Australia COUNTRY for both business and leisure.
She reported brief stays in
Mexico COUNTRY City and Cairo City as well.
All travel occurred prior to surgery, and she denied any symptoms during or after her trips.
" + ] + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "\n", + " Clinical Entities\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "\n", + " A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy,
total anterior hysterectomy with radical pelvic lymph nodes dissection due to
ovarian carcinoma Cancer_dxPast
(
mucinous-type carcinoma Cancer_dxPast , stage Ic) 1 year ago. The patient's medical compliance was poor and failed
to complete her chemotherapy (
cyclophosphamide DRUGPast 750 mg/m2, carboplatin DRUGPast 300 mg/m2).
Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast
in 2 months. Core needle biopsy revealed
metaplastic carcinoma Cancer_dx.
Neoadjuvant chemotherapy with the regimens of
Taxotere DRUGPast (75 mg/m2), Epirubicin DRUGPast (75 mg/m2),
and
Cyclophosphamide DRUGPast (500 mg/m2) was given for 6 cycles with poor response,
followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting.
Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions.
The histopathologic examination revealed a
metaplastic carcinoma Cancer_dxconditional with squamous differentiation
associated with
adenomyoepithelioma Cancer_dx.
Immunohistochemistry study showed that the tumor cells are positive for epithelial markers
(
cytokeratin AE1/AE3 BiomarkerPast ), and myoepithelial markers BiomarkerPast , including CK 5/6 BiomarkerPast , p63 BiomarkerPast , and S100 BiomarkerPast .
Expressions of hormone receptors, including
ER Biomarkerabsent , PR Biomarkerabsent , and Her-2/Neu Biomarkerabsent , were all negative." + ] + }, + "metadata": {} + } + ], + "source": [ + "# Visualize NER entities (without assertions)\n", + "print(\"Ner Result Entities \")\n", + "nlp.viz.NerVisualizer().display(result.collect()[0], 'ner_chunk')\n", + "\n", + "# Visualize clinical entities with their assertions\n", + "print(\"\\n\\n Clinical Entities\")\n", + "nlp.viz.AssertionVisualizer().display(result.collect()[5], 'clinical_entities', 'clinical_assertions')\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file From 7035fdacf969735da4c79067e98b2862ef102f86 Mon Sep 17 00:00:00 2001 From: "Cabir C." <64752006+Cabir40@users.noreply.github.com> Date: Wed, 10 Sep 2025 16:29:20 +0200 Subject: [PATCH 2/3] renamed --- ..._models.ipynb => RULE_BASE_PIPELINE.ipynb} | 74 +++++++++---------- 1 file changed, 37 insertions(+), 37 deletions(-) rename tutorials/streamlit_notebooks/healthcare/{rule_based_ner_and_assertion_models.ipynb => RULE_BASE_PIPELINE.ipynb} (99%) diff --git a/tutorials/streamlit_notebooks/healthcare/rule_based_ner_and_assertion_models.ipynb b/tutorials/streamlit_notebooks/healthcare/RULE_BASE_PIPELINE.ipynb similarity index 99% rename from tutorials/streamlit_notebooks/healthcare/rule_based_ner_and_assertion_models.ipynb rename to tutorials/streamlit_notebooks/healthcare/RULE_BASE_PIPELINE.ipynb index 138de3b2e..211e4e9a0 100644 --- a/tutorials/streamlit_notebooks/healthcare/rule_based_ner_and_assertion_models.ipynb +++ b/tutorials/streamlit_notebooks/healthcare/RULE_BASE_PIPELINE.ipynb @@ -152,8 +152,8 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "sentence_detector_dl_healthcare download started this may take some time.\n", "Approximate size to download 367.3 KB\n", @@ -300,12 +300,12 @@ }, { "cell_type": "markdown", - "source": [ - "# Fit the Pipeline" - ], "metadata": { "id": "NTJoOM-hhGNm" - } + }, + "source": [ + "# Fit the Pipeline" + ] }, { "cell_type": "code", @@ -321,12 +321,12 @@ }, { "cell_type": "markdown", - "source": [ - "# Sample Text to Tryout the Pipeline" - ], "metadata": { "id": "ZMS3oL1bhLhz" - } + }, + "source": [ + "# Sample Text to Tryout the Pipeline" + ] }, { "cell_type": "code", @@ -337,7 +337,7 @@ "outputs": [], "source": [ "sample_texts = [\n", - " \"\"\"Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093\n", + " \"\"\"Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093\n", "Dr. Sofia Chen, IP: 172.16.254.12\n", "She is a 48-year-old female admitted to Unity Health Institute in Toronto\n", "for thyroidectomy on 11/03/95.\n", @@ -422,12 +422,12 @@ }, { "cell_type": "markdown", - "source": [ - "# Apply the Pipeline to Sample Texts" - ], "metadata": { "id": "NydYOOBEhYoF" - } + }, + "source": [ + "# Apply the Pipeline to Sample Texts" + ] }, { "cell_type": "code", @@ -443,12 +443,12 @@ }, { "cell_type": "markdown", - "source": [ - "# Print the Results for NER" - ], "metadata": { "id": "tvklRLZihkP0" - } + }, + "source": [ + "# Print the Results for NER" + ] }, { "cell_type": "code", @@ -462,8 +462,8 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "NER entities\n", "+-----------------------------+-----+----+---------+\n", @@ -550,12 +550,12 @@ }, { "cell_type": "markdown", - "source": [ - "# Print Results for Assertions" - ], "metadata": { "id": "ZXdXHRvLhu8J" - } + }, + "source": [ + "# Print Results for Assertions" + ] }, { "cell_type": "code", @@ -569,8 +569,8 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Assertions\n", "+-----------------------+-----+----+----------------------------+\n", @@ -628,18 +628,14 @@ }, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Ner Result Entities \n" ] }, { - "output_type": "display_data", "data": { - "text/plain": [ - "" - ], "text/html": [ "\n", "\n", " Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093
Dr. Sofia Chen, IP: 172.16.254.12
She is a 48-year-old female admitted to Unity Health Institute in
Toronto City
for thyroidectomy on 11/03/95.
Patient's VIN: JH4KA8270MC012345, SSN: 333-22-7777, Driver’s License: P987654F
Phone: +1 (647) 555-1122, Address: 789 Queen Street,
Toronto City, Canada COUNTRY,
Email: rina.patel@caremail.org
In the past 18 months, the patient has traveled to
India COUNTRY, Germany COUNTRY, Brazil COUNTRY,
South Korea COUNTRY, Morocco COUNTRY, and Australia COUNTRY for both business and leisure.
She reported brief stays in
Mexico COUNTRY City and Cairo City as well.
All travel occurred prior to surgery, and she denied any symptoms during or after her trips.
" + ], + "text/plain": [ + "" ] }, - "metadata": {} + "metadata": {}, + "output_type": "display_data" }, { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "\n", "\n", @@ -736,11 +736,7 @@ ] }, { - "output_type": "display_data", "data": { - "text/plain": [ - "" - ], "text/html": [ "\n", "\n", " A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy,
total anterior hysterectomy with radical pelvic lymph nodes dissection due to
ovarian carcinoma Cancer_dxPast
(
mucinous-type carcinoma Cancer_dxPast , stage Ic) 1 year ago. The patient's medical compliance was poor and failed
to complete her chemotherapy (
cyclophosphamide DRUGPast 750 mg/m2, carboplatin DRUGPast 300 mg/m2).
Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast
in 2 months. Core needle biopsy revealed
metaplastic carcinoma Cancer_dx.
Neoadjuvant chemotherapy with the regimens of
Taxotere DRUGPast (75 mg/m2), Epirubicin DRUGPast (75 mg/m2),
and
Cyclophosphamide DRUGPast (500 mg/m2) was given for 6 cycles with poor response,
followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting.
Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions.
The histopathologic examination revealed a
metaplastic carcinoma Cancer_dxconditional with squamous differentiation
associated with
adenomyoepithelioma Cancer_dx.
Immunohistochemistry study showed that the tumor cells are positive for epithelial markers
(
cytokeratin AE1/AE3 BiomarkerPast ), and myoepithelial markers BiomarkerPast , including CK 5/6 BiomarkerPast , p63 BiomarkerPast , and S100 BiomarkerPast .
Expressions of hormone receptors, including
ER Biomarkerabsent , PR Biomarkerabsent , and Her-2/Neu Biomarkerabsent , were all negative." + ], + "text/plain": [ + "" ] }, - "metadata": {} + "metadata": {}, + "output_type": "display_data" } ], "source": [ @@ -853,4 +853,4 @@ }, "nbformat": 4, "nbformat_minor": 0 -} \ No newline at end of file +} From f8fb3898e4968c1f60f0a6411414a2f797f52440 Mon Sep 17 00:00:00 2001 From: Jay Gil <55799578+JayGilDS@users.noreply.github.com> Date: Wed, 10 Sep 2025 21:06:52 -0300 Subject: [PATCH 3/3] Changed NB - New examples --- .../healthcare/RULE_BASED_PIPELINE.ipynb | 936 ++++++++++++++++++ .../healthcare/RULE_BASE_PIPELINE.ipynb | 856 ---------------- 2 files changed, 936 insertions(+), 856 deletions(-) create mode 100644 tutorials/streamlit_notebooks/healthcare/RULE_BASED_PIPELINE.ipynb delete mode 100644 tutorials/streamlit_notebooks/healthcare/RULE_BASE_PIPELINE.ipynb diff --git a/tutorials/streamlit_notebooks/healthcare/RULE_BASED_PIPELINE.ipynb b/tutorials/streamlit_notebooks/healthcare/RULE_BASED_PIPELINE.ipynb new file mode 100644 index 000000000..90caf8b8c --- /dev/null +++ b/tutorials/streamlit_notebooks/healthcare/RULE_BASED_PIPELINE.ipynb @@ -0,0 +1,936 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "eVLAFDcYwzCs" + }, + "source": [ + "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lGu6GyR8uTmE" + }, + "source": [ + "# Rule Based NER and Assertion Models\n", + "This notebook demonstrates the use of rule-based Named Entity Recognition (NER) combined with assertion detection models for structured information extraction from text.\n", + "Here we demo a series of Text Matcher models, each designed to identify and extract entities of interest, such as states, cities, and drugs, using pre-defined dictionaries and linguistic patterns. By applying these targeted matchers, we can ensure high precision in entity identification, especially in specialized contexts where standard models may underperform.\n", + "\n", + "Beyond entity detection, the notebook also integrates Contextual Assertion Models, which determine the status of an entity in context. For example, whether a drug is mentioned as being possibly used (Detect Possible Assertion) or conditionally prescribed (Detect Conditional Assertion).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F3y-4CRcw-GS" + }, + "source": [ + "## **🎬 Colab Setup**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ltXTfuG-WZ-B", + "collapsed": true + }, + "outputs": [], + "source": [ + "# import johnsnowlabs library\n", + "!pip install -q johnsnowlabs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MATjEDqWW3qL", + "collapsed": true + }, + "outputs": [], + "source": [ + "# Upload license key for healthcare NLP\n", + "from google.colab import files\n", + "print('Please Upload your John Snow Labs License using the button below')\n", + "license_keys = files.upload()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xWpo7EWIy7Zm" + }, + "outputs": [], + "source": [ + "from johnsnowlabs import nlp, medical\n", + "nlp.settings.enforce_versions=True\n", + "nlp.install(refresh_install=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hATfV8oIW53Q" + }, + "outputs": [], + "source": [ + "# import Spark NLP and Spark NLP for Healthcare from johnsnowlabs library\n", + "from johnsnowlabs import nlp, medical\n", + "\n", + "nlp.install()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uWM6-MHhXPSL" + }, + "outputs": [], + "source": [ + "# import required modules\n", + "from sparknlp.base import *\n", + "from pyspark.ml import Pipeline\n", + "\n", + "spark = nlp.start()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M3mzYaSTyeqY" + }, + "source": [ + "# 🔎 MODELS\n", + "Models used in this pipeline and the entities they extract." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2H9XO0i9ycNQ" + }, + "source": [ + "| Index | Model | Entities |\n", + "|---:|:------------------------|:-|\n", + "| 1 | [country_matcher](https://nlp.johnsnowlabs.com/2024/10/23/country_matcher_en.html) | Country |\n", + "| 2 | [state_matcher](https://nlp.johnsnowlabs.com/2024/09/11/state_matcher_en.html) | State |\n", + "| 3 | [city_matcher](https://nlp.johnsnowlabs.com/2024/07/02/city_matcher_en.html) | City |\n", + "| 4 | [drug_matcher](https://nlp.johnsnowlabs.com/2024/03/19/drug_matcher_en.html) | Drug |\n", + "| 5 | [biomarker_matcher](https://nlp.johnsnowlabs.com/2024/03/06/biomarker_matcher_en.html) | Biomarker |\n", + "| 6 | [cancer_diagnosis_matcher](https://nlp.johnsnowlabs.com/2024/06/17/cancer_diagnosis_matcher_en.html) | Cancer_dx |\n", + "| 7 | [contextual_assertion_conditional](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_conditional_en.html) | Conditional |\n", + "| 8 | [contextual_assertion_possible](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_possible_en.html) | Possible |\n", + "| 9 | [contextual_assertion_someone_else](https://nlp.johnsnowlabs.com/2024/06/26/contextual_assertion_someone_else_en.html) | Someone_else |\n", + "| 10 | [contextual_assertion_absent](https://nlp.johnsnowlabs.com/2024/07/03/contextual_assertion_absent_en.html) | Absent |\n", + "| 11 | [contextual_assertion_past](https://nlp.johnsnowlabs.com/2024/07/04/contextual_assertion_past_en.html) | Past |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZrTceG5Lrwcl" + }, + "source": [ + "# Rule-based Pipeline with Separated Entity Processing\n", + "This pipeline combines rule-based NER models and assertion models" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zEoZUN40M8lK", + "outputId": "80beef22-7208-4821-f4cf-a3e00f0cdf4b", + "collapsed": true + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "sentence_detector_dl_healthcare download started this may take some time.\n", + "Approximate size to download 367.3 KB\n", + "[OK!]\n", + "country_matcher download started this may take some time.\n", + "Approximate size to download 10.2 KB\n", + "[OK!]\n", + "state_matcher download started this may take some time.\n", + "Approximate size to download 6.1 KB\n", + "[OK!]\n", + "city_matcher download started this may take some time.\n", + "Approximate size to download 180.3 KB\n", + "[OK!]\n", + "drug_matcher download started this may take some time.\n", + "Approximate size to download 5.5 MB\n", + "[OK!]\n", + "biomarker_matcher download started this may take some time.\n", + "Approximate size to download 25.6 KB\n", + "[OK!]\n", + "cancer_diagnosis_matcher download started this may take some time.\n", + "Approximate size to download 42.8 KB\n", + "[OK!]\n", + "contextual_assertion_conditional download started this may take some time.\n", + "Approximate size to download 1.3 KB\n", + "[OK!]\n", + "contextual_assertion_possible download started this may take some time.\n", + "Approximate size to download 1.7 KB\n", + "[OK!]\n", + "contextual_assertion_someone_else download started this may take some time.\n", + "Approximate size to download 1.5 KB\n", + "[OK!]\n", + "contextual_assertion_absent download started this may take some time.\n", + "Approximate size to download 1.3 KB\n", + "[OK!]\n", + "contextual_assertion_past download started this may take some time.\n", + "Approximate size to download 1.5 KB\n", + "[OK!]\n" + ] + } + ], + "source": [ + "document_assembler = nlp.DocumentAssembler()\\\n", + " .setInputCol(\"text\")\\\n", + " .setOutputCol(\"document\")\n", + "\n", + "sentence_detector = nlp.SentenceDetectorDLModel.pretrained(\"sentence_detector_dl_healthcare\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"document\"])\\\n", + " .setOutputCol(\"sentence\")\n", + "\n", + "tokenizer = nlp.Tokenizer()\\\n", + " .setInputCols([\"sentence\"])\\\n", + " .setOutputCol(\"token\")\n", + "\n", + "country_matcher = medical.TextMatcherModel.pretrained(\"country_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"country\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "state_matcher = medical.TextMatcherModel.pretrained(\"state_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"state\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "city_matcher = medical.TextMatcherModel.pretrained(\"city_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"city\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "drug_matcher = medical.TextMatcherModel.pretrained(\"drug_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"drug\")\n", + "\n", + "biomarker_matcher = medical.TextMatcherModel.pretrained(\"biomarker_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"biomarker\")\n", + "\n", + "cancer_diagnosis_matcher = medical.TextMatcherModel.pretrained(\"cancer_diagnosis_matcher\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"cancer_dx\")\\\n", + " .setMergeOverlapping(True)\n", + "\n", + "# Merge all NER entities\n", + "chunk_merger = medical.ChunkMergeApproach()\\\n", + " .setInputCols([\"drug\", \"biomarker\", \"cancer_dx\",\"country\", \"state\", \"city\"])\\\n", + " .setOutputCol(\"ner_chunk\")\\\n", + " .setSelectionStrategy(\"Sequential\")\n", + "\n", + "# Merge clinical entities (for assertions)\n", + "clinical_merger = medical.ChunkMergeApproach()\\\n", + " .setInputCols([\"drug\", \"biomarker\", \"cancer_dx\"])\\\n", + " .setOutputCol(\"clinical_entities\")\\\n", + " .setSelectionStrategy(\"DiverseLonger\")\\\n", + " .setOrderingFeatures([\"ChunkLength\"])\n", + "\n", + "# Assertion models (only for clinical entities)\n", + "contextual_assertion_conditional = medical.ContextualAssertion.pretrained(\"contextual_assertion_conditional\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_conditional\")\n", + "\n", + "contextual_assertion_possible = medical.ContextualAssertion.pretrained(\"contextual_assertion_possible\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_possible\")\n", + "\n", + "contextual_assertion_someone_else = medical.ContextualAssertion.pretrained(\"contextual_assertion_someone_else\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_someone_else\")\n", + "\n", + "contextual_assertion_absent = medical.ContextualAssertion.pretrained(\"contextual_assertion_absent\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_absent\")\n", + "\n", + "contextual_assertion_past = medical.ContextualAssertion.pretrained(\"contextual_assertion_past\", \"en\", \"clinical/models\")\\\n", + " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", + " .setOutputCol(\"assertion_past\")\n", + "\n", + "assertion_merger = medical.AssertionMerger()\\\n", + " .setInputCols([\"assertion_conditional\", \"assertion_possible\", \"assertion_someone_else\", \"assertion_absent\", \"assertion_past\"])\\\n", + " .setOutputCol(\"clinical_assertions\")\\\n", + " .setMergeOverlapping(True)\\\n", + " .setSelectionStrategy(\"sequential\")\\\n", + " .setAssertionSourcePrecedence(\"assertion_conditional, assertion_possible, assertion_someone_else, assertion_absent, assertion_past\")\\\n", + " .setCaseSensitive(False)\n", + "\n", + "pipeline = nlp.Pipeline(stages=[\n", + " document_assembler,\n", + " sentence_detector,\n", + " tokenizer,\n", + " country_matcher,\n", + " state_matcher,\n", + " city_matcher,\n", + " drug_matcher,\n", + " biomarker_matcher,\n", + " cancer_diagnosis_matcher,\n", + " chunk_merger,\n", + " clinical_merger,\n", + " contextual_assertion_conditional,\n", + " contextual_assertion_possible,\n", + " contextual_assertion_someone_else,\n", + " contextual_assertion_absent,\n", + " contextual_assertion_past,\n", + " assertion_merger\n", + "])\n" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Fit the Pipeline" + ], + "metadata": { + "id": "NTJoOM-hhGNm" + } + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "DF_wapDLOB6j" + }, + "outputs": [], + "source": [ + "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", + "fitted_pipeline = pipeline.fit(empty_data)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Sample Clinical Notes to Tryout the Pipeline" + ], + "metadata": { + "id": "ZMS3oL1bhLhz" + } + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": { + "id": "q3jvHLA9lcFZ" + }, + "outputs": [], + "source": [ + "sample_texts = [\n", + " \"\"\"\n", + "Name: Patel, Rina  Record Date: 1996-11-03  MR: 781093\n", + "\n", + "Dr. Sofia Chen\n", + "\n", + "Presentation Summary\n", + "A 48-year-old female admitted to Unity Health Institute in Toronto\n", + "for thyroidectomy.\n", + "\n", + "PMH\n", + "She takes clopidogrel for her cardiovascular risk.\n", + "History of ibuprofen use for muscle pain and history of azithromycin for a sinus infection two months ago.\n", + "Denies remote history of other types of cancer.\n", + "She experiences mild gastric upset when taking ibuprofen, particularly if ingested without food.\n", + "\n", + "HPI\n", + "Ms. Patel is a 48-year-old woman with a diagnosis of papillary thyroid carcinoma.\n", + "Ultrasound showes a 2.6 cm TI-RADS 5 nodule; FNA confirmes malignancy. Pre-op labs revealed suppressed TSH (0.14 mIU/L), elevated thyroglobulin (135 ng/mL), and BRAF V600E mutation positivity.\n", + "She underwent hemithyroidectomy with central neck dissection on 11/03/95. Pathology showed multifocal disease (largest 2.8 cm), capsular invasion, and 3/12 positive lymph nodes.\n", + "Serum thyroglobulin are suggestive of rising during periods of noncompliance with levothyroxine therapy.\n", + "\n", + "Family History\n", + "Sister and brother both have asthma. Grandfather had lung cancer in his late 70s.\n", + "\n", + "Social history\n", + "No smoking, alcohol or drug use history.\n", + "In the past 18 months, the patient has traveled to India, Germany and Brazil for both business and leisure.\n", + "Two months ago she visited her sister in Los Angeles, California.\n", + "She denies any symptoms during or after her trips.\n", + "\n", + "Review of Systems\n", + "General: Denies fever, chills, unintended weight loss, or night sweats.\n", + "Patient resting in bed. Patient given azithromycin without any difficulty.\n", + "Patient denies nausea at this time. zofran declined.\n", + "Cardiovascular: Reports intermittent chest pain on exertion, worse with stress and heavy meals. Denies palpitations, syncope, or orthopnea.\n", + "Respiratory: Denies cough, hemoptysis, or shortness of breath at rest.\n", + "\"\"\",\n", + "\n", + "\"\"\"\n", + "Name: Johnson, Maria  Record Date: 2000-04-12  MR: 894562\n", + "\n", + "Dr. Daniel Romero\n", + "\n", + "Presentation Summary\n", + "A 55-year-old female admitted to St. Mary’s Medical Center in Chicago for evaluation and surgical management of colorectal carcinoma.\n", + "\n", + "PMH\n", + "She is currently on atorvastatin for hyperlipidemia.\n", + "History of amoxicillin use for recurrent sinus infections and history of naproxen for joint pain one year ago.\n", + "Denies prior history of breast or thyroid cancer.\n", + "She experiences mild dizziness when taking amoxicillin, particularly if combined with alcohol.\n", + "\n", + "HPI\n", + "Ms. Johnson is a 55-year-old woman with a recent diagnosis of colorectal adenocarcinoma. Colonoscopy revealed a 4.1 cm mass in the colon; biopsy confirmed malignancy. Pre-op biomarkers showed:\n", + "CEA: 18.2 ng/mL (elevated)\n", + "KRAS mutation: Positive\n", + "TSH: 2.1 mIU/L (within normal range)\n", + "CT scan of the abdomen demonstrated possible lymph node involvement. Findings were suggestive of early hepatic metastasis, though not definitive.\n", + "She underwent subtotal colectomy with lymph node dissection on 04/10/96. Pathology shows moderately differentiated adenocarcinoma with 2/15 positive lymph nodes.\n", + "\n", + "Family History\n", + "Sister diagnosed with type 2 diabetes.\n", + "Brother has hypertension.\n", + "Grandfather died of prostate cancer at age 82.\n", + "Other family members reported history of cardiovascular disease.\n", + "\n", + "Social History\n", + "No smoking or recreational drug use. Occasional alcohol.\n", + "In the past 2 years, the patient has traveled to Spain, Mexico, and South Korea for work.\n", + "Three months ago she stayed with her brother in Miami, Florida.\n", + "She denies any gastrointestinal symptoms during or after her trips.\n", + "\n", + "Review of Systems\n", + "General: Denies fever, chills, or unintended weight loss. Patient resting comfortably.\n", + "GI: Repots intermittent abdominal pain, worse after large meals. Patient denies current nausea. Ondansetron declined when offered postoperatively.\n", + "Cardiovascular: Reports intermittent palpitations, particularly if under stress. Denies syncope or chest pressure.\n", + "Respiratory: Denies cough, hemoptysis, or shortness of breath.\n", + "Medications/Drug exposures: Currently on atorvastatin. History of amoxicillin and naproxen use. Denies anticoagulant, opioid, or benzodiazepine use.\n", + "\"\"\",\n", + "\n", + "\"\"\"\n", + "Patient ID: MR-552341\n", + "Name: Alvarez, David  Date: 2015-07-22\n", + "Consulting Physician: Dr. Helen Matsuda\n", + "\n", + "Initial Encounter\n", + "A 62-year-old male presented to Mercy Medical Center in Baltimore for evaluation of a persistent cough and unintentional weight loss.\n", + "\n", + "Medical Background\n", + "Current therapy: Metoprolol for hypertension.\n", + "History of prednisone use for bronchitis and history of doxycycline for pneumonia five years ago.\n", + "Denies history of prior malignancy.\n", + "Reports dizziness with metoprolol, particularly if taken on an empty stomach.\n", + "\n", + "Clinical Course\n", + "Mr. Alvarez underwent CT chest, which revealed a 3.9 cm spiculated lesion in the right upper lobe. PET scan demonstrated increased uptake in hilar nodes, suggestive of lung cancer metastatic disease.\n", + "Biopsy confirmed non–small cell lung carcinoma (NSCLC).\n", + "Molecular markers included:\n", + "EGFR mutation: Negative\n", + "ALK rearrangement: Positive\n", + "CEA: 12.5 ng/mL (elevated)\n", + "The patient was counseled regarding targeted therapy options. He declined enrollment in a clinical trial but is considering ALK-inhibitor therapy.\n", + "\n", + "Family & Genetic History\n", + "Sister has rheumatoid arthritis.\n", + "Brother has COPD.\n", + "Grandfather died of gastric cancer at age 79.\n", + "Other family members with cardiovascular disease and stroke.\n", + "\n", + "Lifestyle & Exposures\n", + "Denies tobacco or recreational drug use, but reports a 15-year past history of smoking, quit 20 years ago.\n", + "Occasional alcohol intake.\n", + "Recent travel to Japan, Canada, and Argentina for conferences.\n", + "One month ago he went to a concert in Houston, Texas.\n", + "He denies any respiratory symptoms during or after his travels.\n", + "\n", + "System Review\n", + "General: Fatigue and 5 kg weight loss over 3 months. Denies fever or chills.\n", + "Respiratory: Chronic cough with streaks of blood; denies wheezing at rest.\n", + "Cardiac: Reports palpitations, particularly if walking uphill; denies syncope.\n", + "GI: No abdominal pain, but appetite loss. Patient denies nausea; ondansetron declined when offered in ED.\n", + "Neurological: Denies headaches or seizures.\n", + "\"\"\"\n", + "]\n" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Apply the Pipeline to Sample Texts" + ], + "metadata": { + "id": "NydYOOBEhYoF" + } + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": { + "id": "9o6SeibHWssx" + }, + "outputs": [], + "source": [ + "data = spark.createDataFrame([[text] for text in sample_texts]).toDF(\"text\")\n", + "result = fitted_pipeline.transform(data)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Print the Results for NER" + ], + "metadata": { + "id": "tvklRLZihkP0" + } + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "laI4cuMZT_5N", + "outputId": "0337bf1a-3acc-4575-e17a-efb37ffb9f91" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "NER entities\n", + "+---------------------------+-----+----+---------+\n", + "|result |begin|end |entity |\n", + "+---------------------------+-----+----+---------+\n", + "|Toronto |153 |159 |City |\n", + "|clopidogrel |195 |205 |DRUG |\n", + "|ibuprofen |247 |255 |DRUG |\n", + "|azithromycin |292 |303 |DRUG |\n", + "|cancer |383 |388 |Cancer_dx|\n", + "|ibuprofen |438 |446 |DRUG |\n", + "|papillary thyroid carcinoma|546 |572 |Cancer_dx|\n", + "|malignancy |634 |643 |Cancer_dx|\n", + "|TSH |678 |680 |Biomarker|\n", + "|thyroglobulin |705 |717 |Biomarker|\n", + "|BRAF |736 |739 |Biomarker|\n", + "|Serum thyroglobulin |946 |964 |Biomarker|\n", + "|levothyroxine |1028 |1040|DRUG |\n", + "|lung cancer |1120 |1130|Cancer_dx|\n", + "|India |1257 |1261|COUNTRY |\n", + "|Germany |1264 |1270|COUNTRY |\n", + "|Brazil |1276 |1281|COUNTRY |\n", + "|Los Angeles |1355 |1365|City |\n", + "|California |1368 |1377|STATE |\n", + "|azithromycin |1561 |1572|DRUG |\n", + "|zofran |1634 |1639|DRUG |\n", + "|Chicago |162 |168 |City |\n", + "|colorectal carcinoma |212 |231 |Cancer_dx|\n", + "|atorvastatin |259 |270 |DRUG |\n", + "|amoxicillin |303 |313 |DRUG |\n", + "|naproxen |365 |372 |DRUG |\n", + "|thyroid cancer |437 |450 |Cancer_dx|\n", + "|amoxicillin |496 |506 |DRUG |\n", + "|colorectal adenocarcinoma |615 |639 |Cancer_dx|\n", + "|malignancy |708 |717 |Cancer_dx|\n", + "|CEA |746 |748 |Biomarker|\n", + "|KRAS |773 |776 |Biomarker|\n", + "|TSH |797 |799 |Biomarker|\n", + "|adenocarcinoma |1095 |1108|Cancer_dx|\n", + "|prostate cancer |1243 |1257|Cancer_dx|\n", + "|Spain |1457 |1461|COUNTRY |\n", + "|Mexico |1464 |1469|COUNTRY |\n", + "|South Korea |1476 |1486|COUNTRY |\n", + "|Miami |1546 |1550|City |\n", + "|Florida |1553 |1559|STATE |\n", + "|Ondansetron |1832 |1842|DRUG |\n", + "|atorvastatin |2102 |2113|DRUG |\n", + "|amoxicillin |2127 |2137|DRUG |\n", + "|naproxen |2143 |2150|DRUG |\n", + "|Baltimore |177 |185 |City |\n", + "|prednisone |332 |341 |DRUG |\n", + "|doxycycline |377 |387 |DRUG |\n", + "|malignancy |443 |452 |Cancer_dx|\n", + "|lung cancer |718 |728 |Cancer_dx|\n", + "|lung carcinoma |782 |795 |Cancer_dx|\n", + "|NSCLC |798 |802 |Cancer_dx|\n", + "|EGFR |834 |837 |Biomarker|\n", + "|ALK |858 |860 |Biomarker|\n", + "|CEA |886 |888 |Biomarker|\n", + "|gastric cancer |1157 |1170|Cancer_dx|\n", + "|Japan |1418 |1422|COUNTRY |\n", + "|Canada |1425 |1430|COUNTRY |\n", + "|Argentina |1437 |1445|COUNTRY |\n", + "|Houston |1502 |1508|City |\n", + "|Texas |1511 |1515|STATE |\n", + "|ondansetron |1893 |1903|DRUG |\n", + "+---------------------------+-----+----+---------+\n", + "\n" + ] + } + ], + "source": [ + "# Print results for all NER entities\n", + "print(\"NER entities\")\n", + "result.selectExpr(\"explode(ner_chunk)\").select(\"col.result\", \"col.begin\", \"col.end\", \"col.metadata.entity\").show(100, truncate=False)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Print Results for Assertions" + ], + "metadata": { + "id": "ZXdXHRvLhu8J" + } + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "07E49VsIUXNZ", + "outputId": "8f7540bd-8c45-4387-af9a-4078ac6f486c" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Assertions\n", + "+-------------------+-----+----+----------------------------+\n", + "|ner_chunk |begin|end |result |\n", + "+-------------------+-----+----+----------------------------+\n", + "|ibuprofen |247 |255 |Past |\n", + "|azithromycin |292 |303 |Past |\n", + "|cancer |383 |388 |absent |\n", + "|ibuprofen |438 |446 |conditional |\n", + "|Serum thyroglobulin|946 |964 |possible |\n", + "|levothyroxine |1028 |1040|possible |\n", + "|lung cancer |1120 |1130|associated_with_someone_else|\n", + "|zofran |1634 |1639|absent |\n", + "|amoxicillin |303 |313 |Past |\n", + "|naproxen |365 |372 |Past |\n", + "|thyroid cancer |437 |450 |absent |\n", + "|amoxicillin |496 |506 |conditional |\n", + "|prostate cancer |1243 |1257|associated_with_someone_else|\n", + "|Ondansetron |1832 |1842|absent |\n", + "|amoxicillin |2127 |2137|Past |\n", + "|naproxen |2143 |2150|Past |\n", + "|prednisone |332 |341 |Past |\n", + "|doxycycline |377 |387 |Past |\n", + "|malignancy |443 |452 |absent |\n", + "|lung cancer |718 |728 |possible |\n", + "|EGFR |834 |837 |absent |\n", + "|gastric cancer |1157 |1170|associated_with_someone_else|\n", + "|ondansetron |1893 |1903|absent |\n", + "+-------------------+-----+----+----------------------------+\n", + "\n" + ] + } + ], + "source": [ + "# Assertions (only for clinical entities: Drug, Biomarker, Cancer_dx)\n", + "print(\"Assertions\")\n", + "result.selectExpr(\"explode(clinical_assertions)\").select(\"col.metadata.ner_chunk\", \"col.begin\", \"col.end\", \"col.result\").show(100, truncate=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rCPFs8I5mmEy" + }, + "source": [ + "# Entity and Assertion Visualization" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Visualize NER entities (without assertions)" + ], + "metadata": { + "id": "AGQmW6ldOcTB" + } + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "MYELOqmlgiD9", + "outputId": "e8708bc2-c79f-4f23-f514-247adbf190d7" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "All Ner Result Entities (without assertions)\n", + "\n", + "\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "\n", + "
Name: Patel, Rina  Record Date: 1996-11-03  MR: 781093

Dr. Sofia Chen

Presentation Summary
A 48-year-old female admitted to Unity Health Institute in
Toronto City
for thyroidectomy.

PMH
She takes
clopidogrel DRUG for her cardiovascular risk.
History of
ibuprofen DRUG use for muscle pain and history of azithromycin DRUG for a sinus infection two months ago.
Denies remote history of other types of
cancer Cancer_dx.
She experiences mild gastric upset when taking
ibuprofen DRUG, particularly if ingested without food.

HPI
Ms. Patel is a 48-year-old woman with a diagnosis of
papillary thyroid carcinoma Cancer_dx.
Ultrasound showes a 2.6 cm TI-RADS 5 nodule; FNA confirmes
malignancy Cancer_dx. Pre-op labs revealed suppressed TSH Biomarker (0.14 mIU/L), elevated thyroglobulin Biomarker (135 ng/mL), and BRAF Biomarker V600E mutation positivity.
She underwent hemithyroidectomy with central neck dissection on 11/03/95. Pathology showed multifocal disease (largest 2.8 cm), capsular invasion, and 3/12 positive lymph nodes.
Serum thyroglobulin Biomarker are suggestive of rising during periods of noncompliance with levothyroxine DRUG therapy.

Family History
Sister and brother both have asthma. Grandfather had
lung cancer Cancer_dx in his late 70s.

Social history
No smoking, alcohol or drug use history.
In the past 18 months, the patient has traveled to
India COUNTRY, Germany COUNTRY and Brazil COUNTRY for both business and leisure.
Two months ago she visited her sister in
Los Angeles City, California STATE.
She denies any symptoms during or after her trips.

Review of Systems
General: Denies fever, chills, unintended weight loss, or night sweats.
Patient resting in bed. Patient given
azithromycin DRUG without any difficulty.
Patient denies nausea at this time.
zofran DRUG declined.
Cardiovascular: Reports intermittent chest pain on exertion, worse with stress and heavy meals. Denies palpitations, syncope, or orthopnea.
Respiratory: Denies cough, hemoptysis, or shortness of breath at rest.
" + ] + }, + "metadata": {} + } + ], + "source": [ + "print(\"\\nAll Ner Result Entities (without assertions)\\n\\n\")\n", + "# fist text sample\n", + "nlp.viz.NerVisualizer().display(result.collect()[0], 'ner_chunk')" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Visualize Clinical Entities with their Assertions" + ], + "metadata": { + "id": "AS6saBFtOPzI" + } + }, + { + "cell_type": "code", + "source": [ + "print(\"\\nOnly NER Clinical Entities with their assertions\\n\\n\")\n", + "#fist text sample\n", + "nlp.viz.AssertionVisualizer().display(result.collect()[0], 'clinical_entities', 'clinical_assertions')" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "Cz4AC6vROCTn", + "outputId": "f00d75c8-c274-4e76-e5ea-e92b8502bcb9" + }, + "execution_count": 87, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "Only NER Clinical Entities with their assertions\n", + "\n", + "\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "\n", + "
Name: Patel, Rina  Record Date: 1996-11-03  MR: 781093

Dr. Sofia Chen

Presentation Summary
A 48-year-old female admitted to Unity Health Institute in Toronto
for thyroidectomy.

PMH
She takes
clopidogrel DRUG for her cardiovascular risk.
History of
ibuprofen DRUGPast use for muscle pain and history of azithromycin DRUGPast for a sinus infection two months ago.
Denies remote history of other types of
cancer Cancer_dxabsent .
She experiences mild gastric upset when taking
ibuprofen DRUGconditional , particularly if ingested without food.

HPI
Ms. Patel is a 48-year-old woman with a diagnosis of
papillary thyroid carcinoma Cancer_dx.
Ultrasound showes a 2.6 cm TI-RADS 5 nodule; FNA confirmes
malignancy Cancer_dx. Pre-op labs revealed suppressed TSH Biomarker (0.14 mIU/L), elevated thyroglobulin Biomarker (135 ng/mL), and BRAF Biomarker V600E mutation positivity.
She underwent hemithyroidectomy with central neck dissection on 11/03/95. Pathology showed multifocal disease (largest 2.8 cm), capsular invasion, and 3/12 positive lymph nodes.
Serum thyroglobulin Biomarkerpossible are suggestive of rising during periods of noncompliance with levothyroxine DRUGpossible therapy.

Family History
Sister and brother both have asthma. Grandfather had
lung cancer Cancer_dxassociated_with_someone_else in his late 70s.

Social history
No smoking, alcohol or drug use history.
In the past 18 months, the patient has traveled to India, Germany and Brazil for both business and leisure.
Two months ago she visited her sister in Los Angeles, California.
She denies any symptoms during or after her trips.

Review of Systems
General: Denies fever, chills, unintended weight loss, or night sweats.
Patient resting in bed. Patient given
azithromycin DRUG without any difficulty.
Patient denies nausea at this time.
zofran DRUGabsent declined.
Cardiovascular: Reports intermittent chest pain on exertion, worse with stress and heavy meals. Denies palpitations, syncope, or orthopnea.
Respiratory: Denies cough, hemoptysis, or shortness of breath at rest.
" + ] + }, + "metadata": {} + } + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/tutorials/streamlit_notebooks/healthcare/RULE_BASE_PIPELINE.ipynb b/tutorials/streamlit_notebooks/healthcare/RULE_BASE_PIPELINE.ipynb deleted file mode 100644 index 211e4e9a0..000000000 --- a/tutorials/streamlit_notebooks/healthcare/RULE_BASE_PIPELINE.ipynb +++ /dev/null @@ -1,856 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "eVLAFDcYwzCs" - }, - "source": [ - "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lGu6GyR8uTmE" - }, - "source": [ - "# Rule Based NER and Assertion Models\n", - "This notebook demonstrates the use of rule-based Named Entity Recognition (NER) combined with assertion detection models for structured information extraction from text.\n", - "Here we demo a series of Text Matcher models, each designed to identify and extract entities of interest, such as states, cities, and drugs, using pre-defined dictionaries and linguistic patterns. By applying these targeted matchers, we can ensure high precision in entity identification, especially in specialized contexts where standard models may underperform.\n", - "\n", - "Beyond entity detection, the notebook also integrates Contextual Assertion Models, which determine the status of an entity in context. For example, whether a drug is mentioned as being possibly used (Detect Possible Assertion) or conditionally prescribed (Detect Conditional Assertion).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F3y-4CRcw-GS" - }, - "source": [ - "## **🎬 Colab Setup**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ltXTfuG-WZ-B" - }, - "outputs": [], - "source": [ - "# import johnsnowlabs library\n", - "!pip install -q johnsnowlabs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MATjEDqWW3qL" - }, - "outputs": [], - "source": [ - "# Upload license key for healthcare NLP\n", - "from google.colab import files\n", - "print('Please Upload your John Snow Labs License using the button below')\n", - "license_keys = files.upload()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "xWpo7EWIy7Zm" - }, - "outputs": [], - "source": [ - "from johnsnowlabs import nlp, medical\n", - "nlp.settings.enforce_versions=True\n", - "nlp.install(refresh_install=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hATfV8oIW53Q" - }, - "outputs": [], - "source": [ - "# import Spark NLP and Spark NLP for Healthcare from johnsnowlabs library\n", - "from johnsnowlabs import nlp, medical\n", - "\n", - "nlp.install()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "uWM6-MHhXPSL" - }, - "outputs": [], - "source": [ - "# import required modules\n", - "from sparknlp.base import *\n", - "from pyspark.ml import Pipeline\n", - "\n", - "spark = nlp.start()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M3mzYaSTyeqY" - }, - "source": [ - "# 🔎 MODELS\n", - "Models used in this pipeline and the entities they extract." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2H9XO0i9ycNQ" - }, - "source": [ - "| Index | Model | Entities |\n", - "|---:|:------------------------|:-|\n", - "| 1 | [country_matcher](https://nlp.johnsnowlabs.com/2024/10/23/country_matcher_en.html) | Country |\n", - "| 2 | [state_matcher](https://nlp.johnsnowlabs.com/2024/09/11/state_matcher_en.html) | State |\n", - "| 3 | [city_matcher](https://nlp.johnsnowlabs.com/2024/07/02/city_matcher_en.html) | City |\n", - "| 4 | [drug_matcher](https://nlp.johnsnowlabs.com/2024/03/19/drug_matcher_en.html) | Drug |\n", - "| 5 | [biomarker_matcher](https://nlp.johnsnowlabs.com/2024/03/06/biomarker_matcher_en.html) | Biomarker |\n", - "| 6 | [cancer_diagnosis_matcher](https://nlp.johnsnowlabs.com/2024/06/17/cancer_diagnosis_matcher_en.html) | Cancer_dx |\n", - "| 7 | [contextual_assertion_conditional](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_conditional_en.html) | Conditional |\n", - "| 8 | [contextual_assertion_possible](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_possible_en.html) | Possible |\n", - "| 9 | [contextual_assertion_someone_else](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_someone_else_en.html) | Someone_else |\n", - "| 10 | [contextual_assertion_absent](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_absent_en.html) | Absent |\n", - "| 11 | [contextual_assertion_past](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_past_en.html) | Past |" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZrTceG5Lrwcl" - }, - "source": [ - "# Rule-based Pipeline with Separated Entity Processing\n", - "This pipeline combines rule-based NER models and assertion models" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "zEoZUN40M8lK", - "outputId": "ee979008-e618-4374-fc9f-5e1cfbc39283" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sentence_detector_dl_healthcare download started this may take some time.\n", - "Approximate size to download 367.3 KB\n", - "[OK!]\n", - "country_matcher download started this may take some time.\n", - "Approximate size to download 10.2 KB\n", - "[OK!]\n", - "state_matcher download started this may take some time.\n", - "Approximate size to download 6.1 KB\n", - "[OK!]\n", - "city_matcher download started this may take some time.\n", - "Approximate size to download 180.3 KB\n", - "[OK!]\n", - "drug_matcher download started this may take some time.\n", - "Approximate size to download 5.5 MB\n", - "[OK!]\n", - "biomarker_matcher download started this may take some time.\n", - "Approximate size to download 25.6 KB\n", - "[OK!]\n", - "cancer_diagnosis_matcher download started this may take some time.\n", - "Approximate size to download 42.8 KB\n", - "[OK!]\n", - "contextual_assertion_conditional download started this may take some time.\n", - "Approximate size to download 1.3 KB\n", - "[OK!]\n", - "contextual_assertion_possible download started this may take some time.\n", - "Approximate size to download 1.7 KB\n", - "[OK!]\n", - "contextual_assertion_someone_else download started this may take some time.\n", - "Approximate size to download 1.5 KB\n", - "[OK!]\n", - "contextual_assertion_absent download started this may take some time.\n", - "Approximate size to download 1.3 KB\n", - "[OK!]\n", - "contextual_assertion_past download started this may take some time.\n", - "Approximate size to download 1.5 KB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "document_assembler = nlp.DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "sentence_detector = nlp.SentenceDetectorDLModel.pretrained(\"sentence_detector_dl_healthcare\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"document\"])\\\n", - " .setOutputCol(\"sentence\")\n", - "\n", - "tokenizer = nlp.Tokenizer()\\\n", - " .setInputCols([\"sentence\"])\\\n", - " .setOutputCol(\"token\")\n", - "\n", - "country_matcher = medical.TextMatcherModel.pretrained(\"country_matcher\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\"])\\\n", - " .setOutputCol(\"country\")\\\n", - " .setMergeOverlapping(True)\n", - "\n", - "state_matcher = medical.TextMatcherModel.pretrained(\"state_matcher\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\"])\\\n", - " .setOutputCol(\"state\")\\\n", - " .setMergeOverlapping(True)\n", - "\n", - "city_matcher = medical.TextMatcherModel.pretrained(\"city_matcher\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\"])\\\n", - " .setOutputCol(\"city\")\\\n", - " .setMergeOverlapping(True)\n", - "\n", - "drug_matcher = medical.TextMatcherModel.pretrained(\"drug_matcher\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\"])\\\n", - " .setOutputCol(\"drug\")\n", - "\n", - "biomarker_matcher = medical.TextMatcherModel.pretrained(\"biomarker_matcher\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\"])\\\n", - " .setOutputCol(\"biomarker\")\n", - "\n", - "cancer_diagnosis_matcher = medical.TextMatcherModel.pretrained(\"cancer_diagnosis_matcher\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\"])\\\n", - " .setOutputCol(\"cancer_dx\")\\\n", - " .setMergeOverlapping(True)\n", - "\n", - "# Merge all NER entities\n", - "chunk_merger = medical.ChunkMergeApproach()\\\n", - " .setInputCols([\"drug\", \"biomarker\", \"cancer_dx\",\"country\", \"state\", \"city\"])\\\n", - " .setOutputCol(\"ner_chunk\")\\\n", - " .setSelectionStrategy(\"Sequential\")\n", - "\n", - "# Merge clinical entities (for assertions)\n", - "clinical_merger = medical.ChunkMergeApproach()\\\n", - " .setInputCols([\"drug\", \"biomarker\", \"cancer_dx\"])\\\n", - " .setOutputCol(\"clinical_entities\")\\\n", - " .setSelectionStrategy(\"DiverseLonger\")\\\n", - " .setOrderingFeatures([\"ChunkLength\"])\n", - "\n", - "# Assertion models (only for clinical entities)\n", - "contextual_assertion_conditional = medical.ContextualAssertion.pretrained(\"contextual_assertion_conditional\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", - " .setOutputCol(\"assertion_conditional\")\n", - "\n", - "contextual_assertion_possible = medical.ContextualAssertion.pretrained(\"contextual_assertion_possible\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", - " .setOutputCol(\"assertion_possible\")\n", - "\n", - "contextual_assertion_someone_else = medical.ContextualAssertion.pretrained(\"contextual_assertion_someone_else\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", - " .setOutputCol(\"assertion_someone_else\")\n", - "\n", - "contextual_assertion_absent = medical.ContextualAssertion.pretrained(\"contextual_assertion_absent\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", - " .setOutputCol(\"assertion_absent\")\n", - "\n", - "contextual_assertion_past = medical.ContextualAssertion.pretrained(\"contextual_assertion_past\", \"en\", \"clinical/models\")\\\n", - " .setInputCols([\"sentence\", \"token\", \"clinical_entities\"])\\\n", - " .setOutputCol(\"assertion_past\")\n", - "\n", - "assertion_merger = medical.AssertionMerger()\\\n", - " .setInputCols([\"assertion_conditional\", \"assertion_possible\", \"assertion_someone_else\", \"assertion_absent\", \"assertion_past\"])\\\n", - " .setOutputCol(\"clinical_assertions\")\\\n", - " .setMergeOverlapping(True)\\\n", - " .setSelectionStrategy(\"sequential\")\\\n", - " .setAssertionSourcePrecedence(\"assertion_conditional, assertion_possible, assertion_someone_else, assertion_absent, assertion_past\")\\\n", - " .setCaseSensitive(False)\n", - "\n", - "pipeline = nlp.Pipeline(stages=[\n", - " document_assembler,\n", - " sentence_detector,\n", - " tokenizer,\n", - " country_matcher,\n", - " state_matcher,\n", - " city_matcher,\n", - " drug_matcher,\n", - " biomarker_matcher,\n", - " cancer_diagnosis_matcher,\n", - " chunk_merger,\n", - " clinical_merger,\n", - " contextual_assertion_conditional,\n", - " contextual_assertion_possible,\n", - " contextual_assertion_someone_else,\n", - " contextual_assertion_absent,\n", - " contextual_assertion_past,\n", - " assertion_merger\n", - "])\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NTJoOM-hhGNm" - }, - "source": [ - "# Fit the Pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "DF_wapDLOB6j" - }, - "outputs": [], - "source": [ - "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", - "fitted_pipeline = pipeline.fit(empty_data)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZMS3oL1bhLhz" - }, - "source": [ - "# Sample Text to Tryout the Pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "q3jvHLA9lcFZ" - }, - "outputs": [], - "source": [ - "sample_texts = [\n", - " \"\"\"Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093\n", - "Dr. Sofia Chen, IP: 172.16.254.12\n", - "She is a 48-year-old female admitted to Unity Health Institute in Toronto\n", - "for thyroidectomy on 11/03/95.\n", - "Patient's VIN: JH4KA8270MC012345, SSN: 333-22-7777, Driver’s License: P987654F\n", - "Phone: +1 (647) 555-1122, Address: 789 Queen Street, Toronto, Canada,\n", - "Email: rina.patel@caremail.org\n", - "In the past 18 months, the patient has traveled to India, Germany, Brazil,\n", - "South Korea, Morocco, and Australia for both business and leisure.\n", - "She reported brief stays in Mexico City and Cairo as well.\n", - "All travel occurred prior to surgery, and she denied any symptoms during or after her trips.\"\"\",\n", - "\n", - " \"\"\"Patient Summary Report\n", - "Name: Green, Thomas L.  DOB: 08/14/2040  Sex: Male  MRN: 559882\n", - "Date of Encounter: 2094-08-30  Facility: St. Margaret’s Medical Center, Atlanta, Georgia\n", - "Physician: Dr. Rebecca Allen, MD – Internal Medicine\n", - "Chief Complaint: Persistent abdominal pain and fatigue for 2 weeks.\n", - "History of Present Illness: Mr. Green is a 54-year-old male who presented\n", - "to the emergency department in Atlanta, GA, with abdominal discomfort\n", - "described as a dull ache localized to the left lower quadrant.\n", - "He reports the pain began during a work trip to Texas and progressively\n", - "worsened while traveling through Nevada and Illinois.\n", - "The patient states he had similar episodes in the past during visits to Florida,\n", - "but those resolved spontaneously. He recently returned from a family reunion\n", - "in New York, where he experienced nausea and loss of appetite.\"\"\",\n", - "\n", - " \"\"\"Name: Laura Martinez  Record Date: 2094-06-15  MR: 927384\n", - "Dr. Anthony Kim, IP: 10.0.0.45\n", - "She is a 62-year-old female admitted to Metropolitan Medical Center\n", - "in San Francisco for a knee replacement on 06/15/94.\n", - "Patient's VIN: 1N4AL11D75C678901, SSN: 555-66-9999, Driver's license no: D456321K\n", - "Phone: (415) 555-6723, 1122 Pine Avenue, Chicago, IL, USA,\n", - "E-mail: laura.martinez@healthmail.org\n", - "Patient has traveled to Rome, Dubai, and Cape Town in the past 12 months.\"\"\",\n", - "\n", - " \"\"\"Maria’s physician prescribed clopidogrel for her cardiovascular risk,\n", - "along with ibuprofen for muscle pain, azithromycin for her sinus infection,\n", - "and omeprazole to manage her acid reflux on 2024-07-18.\"\"\",\n", - "\n", - " \"\"\"In the bone marrow (BM) aspirate, blasts comprised 91.3% of nucleated cells,\n", - "expressing CD10, CD19, CD34, CD45, CD117, CD123, HLA-DR, and TdT by flow cytometric analysis.\n", - "Serum tumor marker evaluation revealed elevated levels of carcinoembryonic antigen (CEA: 6.42 ng/mL),\n", - "alpha-fetoprotein (AFP: 11.75 ng/mL), and pro-gastrin-releasing peptide (ProGRP: 85.3 pg/mL).\"\"\",\n", - "\n", - " \"\"\"A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy,\n", - "total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma\n", - "(mucinous-type carcinoma, stage Ic) 1 year ago. The patient's medical compliance was poor and failed\n", - "to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2).\n", - "Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast\n", - "in 2 months. Core needle biopsy revealed metaplastic carcinoma.\n", - "Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2),\n", - "and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response,\n", - "followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting.\n", - "Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions.\n", - "The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation\n", - "associated with adenomyoepithelioma.\n", - "Immunohistochemistry study showed that the tumor cells are positive for epithelial markers\n", - "(cytokeratin AE1/AE3), and myoepithelial markers, including CK 5/6, p63, and S100.\n", - "Expressions of hormone receptors, including ER, PR, and Her-2/Neu, were all negative.\"\"\",\n", - "\n", - " \"\"\"Patient has a family history of diabetes. Father diagnosed with heart failure last year.\n", - "Sister and brother both have asthma. Grandfather had cancer in his late 70s.\n", - "No known family history of substance abuse. Family history of autoimmune diseases is also noted.\"\"\",\n", - "\n", - " \"\"\"Patient resting in bed. Patient given azithromycin without any difficulty.\n", - "Patient has audible wheezing, states chest tightness.\n", - "No evidence of hypertension. Patient denies nausea at this time. Zofran declined.\n", - "Patient is also having intermittent sweating associated with pneumonia.\"\"\",\n", - "\n", - " \"\"\"The patient presents with symptoms suggestive of pneumonia, including fever, productive cough,\n", - "and mild dyspnea. Chest X-ray findings are compatible with a possible early-stage infection,\n", - "though bacterial pneumonia cannot be entirely excluded.\"\"\",\n", - "\n", - " \"\"\"The patient reports intermittent chest pain when engaging in physical activity,\n", - "particularly on exertion. Symptoms appear to be contingent upon increased stress levels and heavy meals.\"\"\",\n", - "\n", - " \"\"\"History of Present Illness: The patient reports a history of influenza with high fever\n", - "(up to 41 °C) approximately two months ago. He now presents again with flu-like symptoms,\n", - "including fever, but denies productive cough.\n", - "Family History: Father with a history of lung cancer.\"\"\"\n", - "]\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NydYOOBEhYoF" - }, - "source": [ - "# Apply the Pipeline to Sample Texts" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9o6SeibHWssx" - }, - "outputs": [], - "source": [ - "data = spark.createDataFrame([[text] for text in sample_texts]).toDF(\"text\")\n", - "result = fitted_pipeline.transform(data)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tvklRLZihkP0" - }, - "source": [ - "# Print the Results for NER" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "laI4cuMZT_5N", - "outputId": "8ecc2e81-83c3-4c73-cef7-8456f5c51f90" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NER entities\n", - "+-----------------------------+-----+----+---------+\n", - "|result |begin|end |entity |\n", - "+-----------------------------+-----+----+---------+\n", - "|Toronto |155 |161 |City |\n", - "|Toronto |326 |332 |City |\n", - "|Canada |335 |340 |COUNTRY |\n", - "|India |425 |429 |COUNTRY |\n", - "|Germany |432 |438 |COUNTRY |\n", - "|Brazil |441 |446 |COUNTRY |\n", - "|South Korea |449 |459 |COUNTRY |\n", - "|Morocco |462 |468 |COUNTRY |\n", - "|Australia |475 |483 |COUNTRY |\n", - "|Mexico |544 |549 |COUNTRY |\n", - "|Cairo |560 |564 |City |\n", - "|Thomas |36 |41 |DRUG |\n", - "|Atlanta |159 |165 |City |\n", - "|Georgia |168 |174 |COUNTRY |\n", - "|Atlanta |402 |408 |City |\n", - "|Texas |552 |556 |STATE |\n", - "|Nevada |609 |614 |STATE |\n", - "|Illinois |620 |627 |STATE |\n", - "|Florida |702 |708 |STATE |\n", - "|reunion |780 |786 |COUNTRY |\n", - "|New York |791 |798 |STATE |\n", - "|Laura |6 |10 |DRUG |\n", - "|San Francisco |160 |172 |City |\n", - "|Chicago |333 |339 |City |\n", - "|USA |346 |348 |COUNTRY |\n", - "|laura |359 |363 |DRUG |\n", - "|Rome |413 |416 |City |\n", - "|Dubai |419 |423 |City |\n", - "|clopidogrel |29 |39 |DRUG |\n", - "|ibuprofen |81 |89 |DRUG |\n", - "|azithromycin |108 |119 |DRUG |\n", - "|omeprazole |150 |159 |DRUG |\n", - "|CD10 |88 |91 |Biomarker|\n", - "|CD19 |94 |97 |Biomarker|\n", - "|CD34 |100 |103 |Biomarker|\n", - "|CD45 |106 |109 |Biomarker|\n", - "|CD117 |112 |116 |Biomarker|\n", - "|CD123 |119 |123 |Biomarker|\n", - "|HLA-DR |126 |131 |Biomarker|\n", - "|TdT |138 |140 |Biomarker|\n", - "|carcinoembryonic antigen |229 |252 |Biomarker|\n", - "|CEA |255 |257 |Biomarker|\n", - "|alpha-fetoprotein |273 |289 |Biomarker|\n", - "|AFP |292 |294 |Biomarker|\n", - "|pro-gastrin-releasing peptide|315 |343 |Biomarker|\n", - "|ProGRP |346 |351 |Biomarker|\n", - "|ovarian carcinoma |175 |191 |Cancer_dx|\n", - "|mucinous-type carcinoma |194 |216 |Cancer_dx|\n", - "|cyclophosphamide |324 |339 |DRUG |\n", - "|carboplatin |352 |362 |DRUG |\n", - "|metaplastic carcinoma |526 |546 |Cancer_dx|\n", - "|Taxotere |595 |602 |DRUG |\n", - "|Epirubicin |616 |625 |DRUG |\n", - "|Cyclophosphamide |643 |658 |DRUG |\n", - "|metaplastic carcinoma |935 |955 |Cancer_dx|\n", - "|adenomyoepithelioma |1003 |1021|Cancer_dx|\n", - "|cytokeratin AE1/AE3 |1116 |1134|Biomarker|\n", - "|myoepithelial markers |1142 |1162|Biomarker|\n", - "|CK 5/6 |1175 |1180|Biomarker|\n", - "|p63 |1183 |1185|Biomarker|\n", - "|S100 |1192 |1195|Biomarker|\n", - "|ER |1242 |1243|Biomarker|\n", - "|PR |1246 |1247|Biomarker|\n", - "|Her-2/Neu |1254 |1262|Biomarker|\n", - "|cancer |142 |147 |Cancer_dx|\n", - "|azithromycin |38 |49 |DRUG |\n", - "|Zofran |194 |199 |DRUG |\n", - "|lung cancer |264 |274 |Cancer_dx|\n", - "+-----------------------------+-----+----+---------+\n", - "\n" - ] - } - ], - "source": [ - "# Print results for all NER entities\n", - "print(\"NER entities\")\n", - "result.selectExpr(\"explode(ner_chunk)\").select(\"col.result\", \"col.begin\", \"col.end\", \"col.metadata.entity\").show(100, truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZXdXHRvLhu8J" - }, - "source": [ - "# Print Results for Assertions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "07E49VsIUXNZ", - "outputId": "c0f7f892-75cd-4875-85d4-3a05b487fd90" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Assertions\n", - "+-----------------------+-----+----+----------------------------+\n", - "|ner_chunk |begin|end |result |\n", - "+-----------------------+-----+----+----------------------------+\n", - "|ovarian carcinoma |175 |191 |Past |\n", - "|mucinous-type carcinoma|194 |216 |Past |\n", - "|cyclophosphamide |324 |339 |Past |\n", - "|carboplatin |352 |362 |Past |\n", - "|Taxotere |595 |602 |Past |\n", - "|Epirubicin |616 |625 |Past |\n", - "|Cyclophosphamide |643 |658 |Past |\n", - "|metaplastic carcinoma |935 |955 |conditional |\n", - "|cytokeratin AE1/AE3 |1116 |1134|Past |\n", - "|myoepithelial markers |1142 |1162|Past |\n", - "|CK 5/6 |1175 |1180|Past |\n", - "|p63 |1183 |1185|Past |\n", - "|S100 |1192 |1195|Past |\n", - "|ER |1242 |1243|absent |\n", - "|PR |1246 |1247|absent |\n", - "|Her-2/Neu |1254 |1262|absent |\n", - "|cancer |142 |147 |associated_with_someone_else|\n", - "|Zofran |194 |199 |absent |\n", - "|lung cancer |264 |274 |associated_with_someone_else|\n", - "+-----------------------+-----+----+----------------------------+\n", - "\n" - ] - } - ], - "source": [ - "# Assertions (only for clinical entities: Drug, Biomarker, Cancer_dx)\n", - "print(\"Assertions\")\n", - "result.selectExpr(\"explode(clinical_assertions)\").select(\"col.metadata.ner_chunk\", \"col.begin\", \"col.end\", \"col.result\").show(100, truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rCPFs8I5mmEy" - }, - "source": [ - "# Entity and Assertion Visualization" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "MYELOqmlgiD9", - "outputId": "2d8b1cc6-dba9-4124-cd6f-025d9d0104c0" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Ner Result Entities \n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "\n", - " Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093
Dr. Sofia Chen, IP: 172.16.254.12
She is a 48-year-old female admitted to Unity Health Institute in
Toronto City
for thyroidectomy on 11/03/95.
Patient's VIN: JH4KA8270MC012345, SSN: 333-22-7777, Driver’s License: P987654F
Phone: +1 (647) 555-1122, Address: 789 Queen Street,
Toronto City, Canada COUNTRY,
Email: rina.patel@caremail.org
In the past 18 months, the patient has traveled to
India COUNTRY, Germany COUNTRY, Brazil COUNTRY,
South Korea COUNTRY, Morocco COUNTRY, and Australia COUNTRY for both business and leisure.
She reported brief stays in
Mexico COUNTRY City and Cairo City as well.
All travel occurred prior to surgery, and she denied any symptoms during or after her trips.
" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "\n", - " Clinical Entities\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "\n", - " A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy,
total anterior hysterectomy with radical pelvic lymph nodes dissection due to
ovarian carcinoma Cancer_dxPast
(
mucinous-type carcinoma Cancer_dxPast , stage Ic) 1 year ago. The patient's medical compliance was poor and failed
to complete her chemotherapy (
cyclophosphamide DRUGPast 750 mg/m2, carboplatin DRUGPast 300 mg/m2).
Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast
in 2 months. Core needle biopsy revealed
metaplastic carcinoma Cancer_dx.
Neoadjuvant chemotherapy with the regimens of
Taxotere DRUGPast (75 mg/m2), Epirubicin DRUGPast (75 mg/m2),
and
Cyclophosphamide DRUGPast (500 mg/m2) was given for 6 cycles with poor response,
followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting.
Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions.
The histopathologic examination revealed a
metaplastic carcinoma Cancer_dxconditional with squamous differentiation
associated with
adenomyoepithelioma Cancer_dx.
Immunohistochemistry study showed that the tumor cells are positive for epithelial markers
(
cytokeratin AE1/AE3 BiomarkerPast ), and myoepithelial markers BiomarkerPast , including CK 5/6 BiomarkerPast , p63 BiomarkerPast , and S100 BiomarkerPast .
Expressions of hormone receptors, including
ER Biomarkerabsent , PR Biomarkerabsent , and Her-2/Neu Biomarkerabsent , were all negative." - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Visualize NER entities (without assertions)\n", - "print(\"Ner Result Entities \")\n", - "nlp.viz.NerVisualizer().display(result.collect()[0], 'ner_chunk')\n", - "\n", - "# Visualize clinical entities with their assertions\n", - "print(\"\\n\\n Clinical Entities\")\n", - "nlp.viz.AssertionVisualizer().display(result.collect()[5], 'clinical_entities', 'clinical_assertions')\n" - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - }, - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -}