Skip to content
This repository was archived by the owner on Jul 31, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ jobs:
pytest --nbmake -n=auto --nbmake-kernel=smoketests --nbmake-timeout=1800 ./notebooks/introductory/basic/2*.ipynb
pytest --nbmake -n=auto --nbmake-kernel=smoketests --nbmake-timeout=1800 ./notebooks/introductory/basic/3*.ipynb
pytest --nbmake -n=auto --nbmake-kernel=smoketests --nbmake-timeout=1800 ./notebooks/introductory/meta/*.ipynb
pytest --nbmake -n=auto --nbmake-kernel=smoketests --nbmake-timeout=1800 ./notebooks/introductory/custom/*.ipynb

migrate-and-advanved:

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a99ddc26",
"metadata": {},
"source": [
"# Two-step linker\n",
"\n",
"The two-step linker is a linker that attempt to learn and use type information for further context.\n",
"It essentially uses the same context manager as the regular linker, but only for types.\n",
"Since there's generally far fewer types, and they're all quite different, classification between them should be easier.\n",
"This should help with misidentifying wrong types of concepts that share the same name (e.g _infusion (procedure)_ vs _infusion (method of drug admiinstartion)_).\n",
"\n",
"Normally, changing the linker would mean you'd retrain the model.\n",
"That's because the new linker needs to be retrained.\n",
"However, if the train counts are correct, we should be able to infer the per-type traininng from existing per-concept trainining."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4a571a00",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Existing concept (and types), their names, and corrresponding training:\n",
"73211009 (set()) \t Diabetes Mellitus Diagnosed \t 2\n",
"396230008 (set()) \t Wagner Unverricht Syndrome \t 1\n",
"44132006 (set()) \t Abscess \t 2\n",
"128477000 (set()) \t Abscesses \t 2\n"
]
}
],
"source": [
"# let's reuse the supervised trained model, but we'll change the linker\n",
"from medcat.cat import CAT\n",
"cat = CAT.load_model_pack(\"../basic/models/sup_trained_model.zip\")\n",
"print(\"Existing concept (and types), their names, and corrresponding training:\")\n",
"for tui, ci in cat.cdb.cui2info.items():\n",
" print(f\"{tui:16s} ({ci['type_ids']}) \\t {cat.cdb.get_name(tui):24s} \\t {ci['count_train']}\")"
]
},
{
"cell_type": "markdown",
"id": "20905c11",
"metadata": {},
"source": [
"We notice that the current concepts don't have type IDs.\n",
"So we have to add them before we can use 2-step linking.\n",
"Let's just devide them into two types:\n",
"- Ones that start with A\n",
"- Ones that don't start with A\n",
"This is an arbitrary distinction, but should work for our example."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "68594a45",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Existing concept (and types), their names, and corrresponding training:\n",
"73211009 ({'NA'}) \t Diabetes Mellitus Diagnosed \t 2\n",
"396230008 ({'NA'}) \t Wagner Unverricht Syndrome \t 1\n",
"44132006 ({'SA'}) \t Abscess \t 2\n",
"128477000 ({'SA'}) \t Abscesses \t 2\n"
]
}
],
"source": [
"from medcat.cdb.concepts import TypeInfo\n",
"cat.cdb.type_id2info.update({\n",
" \"SA\": TypeInfo(\n",
" type_id=\"SA\",\n",
" name=\"Starts with A\",\n",
" cuis=[cui for cui in cat.cdb.cui2info if cat.cdb.get_name(cui).startswith(\"A\")],\n",
" ),\n",
" \"NA\": TypeInfo(\n",
" type_id=\"NA\",\n",
" name=\"Does not start with A\",\n",
" cuis=[cui for cui in cat.cdb.cui2info if not cat.cdb.get_name(cui).startswith(\"A\")],\n",
" ),\n",
"})\n",
"# NOTE: we also need to update the concept info, so that the new types are included\n",
"for tui, ci in cat.cdb.cui2info.items():\n",
" for ti in cat.cdb.type_id2info.values():\n",
" if tui in ti.cuis:\n",
" ci[\"type_ids\"].add(ti.type_id)\n",
"\n",
"print(\"Existing concept (and types), their names, and corrresponding training:\")\n",
"for tui, ci in cat.cdb.cui2info.items():\n",
" print(f\"{tui:16s} ({ci['type_ids']}) \\t {cat.cdb.get_name(tui):24s} \\t {ci['count_train']}\")"
]
},
{
"cell_type": "markdown",
"id": "c0a19671",
"metadata": {},
"source": [
"Now we've got the necessary information to use a 2-step linker.\n",
"We just need to get the context vectors from the existing training."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6cf1577a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Existing concept (and types), their names, and corrresponding training:\n",
"73211009 ({'NA'}) \t Diabetes Mellitus Diagnosed \t 2\n",
"396230008 ({'NA'}) \t Wagner Unverricht Syndrome \t 1\n",
"44132006 ({'SA'}) \t Abscess \t 2\n",
"128477000 ({'SA'}) \t Abscesses \t 2\n",
"TYPE_ID:SA (set()) \t Starts with A \t 4\n",
"TYPE_ID:NA (set()) \t Does not start with A \t 3\n"
]
}
],
"source": [
"import numpy as np\n",
"from medcat.components.linking.two_step_context_based_linker import TYPE_ID_PREFIX\n",
"# change linking components\n",
"cat.config.components.linking.comp_name = 'medcat2_two_step_linker'\n",
"# recreate pipe with 2-step linker, this won't be automatic\n",
"cat._recreate_pipe()\n",
"# update TypeID counts and context vectors\n",
"def get_context_vectors(tui: str) -> tuple[dict[str, np.ndarray | None], int]:\n",
" cvs = [((ci := cat.cdb.cui2info[cui])['context_vectors'], ci['count_train'])\n",
" for cui in cat.cdb.type_id2info[tui].cuis]\n",
" # print(\"FIRST CSV\", cvs[0][0]['xlong'].shape)\n",
" vec_types = set(\n",
" vt\n",
" for ccvs, _ in cvs if ccvs\n",
" for vt in ccvs)\n",
" out_vec: dict[str, np.ndarray | None] = {}\n",
" max_train = 0\n",
" for vtype in vec_types:\n",
" vecs: list[np.ndarray] = []\n",
" counts: list[int] = []\n",
" for ccvs, count in cvs:\n",
" if vtype in ccvs:\n",
" vecs.append(ccvs[vtype])\n",
" counts.append(count)\n",
" if vecs and counts:\n",
" out_vec[vtype] = np.sum([c * v for c, v in zip(counts, vecs)], axis=0) / sum(counts)\n",
" max_train = max(max_train, sum(counts))\n",
" return out_vec, max_train\n",
"\n",
"for tui, type_ci in cat.cdb.cui2info.items():\n",
" if not tui.startswith(TYPE_ID_PREFIX):\n",
" # ignore actual/regular CUIs\n",
" continue\n",
" type_ci['context_vectors'], type_ci['count_train'] = get_context_vectors(tui.removeprefix(TYPE_ID_PREFIX))\n",
" # print(\"Set CV\", type_ci['context_vectors'], type_ci['context_vectors']['xlong'].shape)\n",
"print(\"Existing concept (and types), their names, and corrresponding training:\")\n",
"for tui, ci in cat.cdb.cui2info.items():\n",
" print(f\"{tui:16s} ({ci['type_ids']}) \\t {cat.cdb.get_name(tui):24s} \\t {ci['count_train']}\")"
]
},
{
"cell_type": "markdown",
"id": "e54063d6",
"metadata": {},
"source": [
"NB: Ideally this is where you would start if you had done the entire process of model creation and training (e.g basics/1 through 3) with a 2-step linker approach.\n",
"\n",
"Now let's explore the model's output.\n",
"To tell the difference (since it's all happening under the hood), we need to enable logging."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c4febc81",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Narrowing down candidates for: 'abscess' from ['44132006', '128477000']\n",
"Adding per CUI to 128477000 (tokens 5..5) weights {'44132006': 0.7396468208767506, '128477000': 0.7396468208767506}\n",
"Linker started with entity: abscess\n",
"Mixing type similarity of 0.9929 and CUI similarity of 0.7396 with 0.7682 weight for CUI similarity\n",
"[Per CUI weights] CUI: 44132006, Name: abscess, Old sim: 0.993, New sim: 0.798\n",
"Mixing type similarity of 0.1344 and CUI similarity of 0.7396 with 0.7682 weight for CUI similarity\n",
"[Per CUI weights] CUI: 128477000, Name: abscess, Old sim: 0.134, New sim: 0.599\n",
"Considering CUI 44132006 with sim 0.798340\n",
"Narrowing down candidates for: 'abscess' from ['44132006', '128477000']\n",
"Adding per CUI to 128477000 (tokens 1..1) weights {'44132006': 0.4386427664494404, '128477000': 0.4386427664494404}\n",
"Linker started with entity: abscess\n",
"Mixing type similarity of 0.2137 and CUI similarity of 0.4386 with 0.4239 weight for CUI similarity\n",
"[Per CUI weights] CUI: 44132006, Name: abscess, Old sim: 0.214, New sim: 0.309\n",
"Mixing type similarity of 0.4625 and CUI similarity of 0.4386 with 0.4239 weight for CUI similarity\n",
"[Per CUI weights] CUI: 128477000, Name: abscess, Old sim: 0.462, New sim: 0.452\n",
"Considering CUI 128477000 with sim 0.452367\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 : {0: {'pretty_name': 'Abscess', 'cui': '44132006', 'type_ids': ['SA'], 'source_value': 'abscess', 'detected_name': 'abscess', 'acc': 0.7983395419982506, 'context_similarity': 0.7983395419982506, 'start': 43, 'end': 50, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}\n",
"1 : {0: {'pretty_name': 'Abscesses', 'cui': '128477000', 'type_ids': ['SA'], 'source_value': 'abscess', 'detected_name': 'abscess', 'acc': 0.45236695339580785, 'context_similarity': 0.45236695339580785, 'start': 3, 'end': 10, 'id': 0, 'meta_anns': {}, 'context_left': [], 'context_center': [], 'context_right': []}}\n"
]
}
],
"source": [
"abscess_text_morph = \"\"\"Histopathology reveals a well-encapsulated abscess with central necrosis and neutrophilic infiltration.\"\"\"\n",
"abscess_text_disorder = \"\"\"An abscess is a disorder, which is a clinical condition characterized by the formation of a painful and inflamed mass containing purulent material\"\"\"\n",
"\n",
"# enable logging\n",
"import logging\n",
"from medcat.components.linking.two_step_context_based_linker import logger as tsl_l\n",
"from medcat.components.linking.context_based_linker import logger as cbl_l\n",
"sh = logging.StreamHandler()\n",
"for l in [tsl_l, cbl_l]:\n",
" if not l.handlers:\n",
" l.addHandler(sh)\n",
" l.setLevel(logging.DEBUG)\n",
"\n",
"for text_num, text in enumerate([abscess_text_morph, abscess_text_disorder]):\n",
" ents = cat.get_entities(text)['entities']\n",
" print(text_num, \":\", ents)"
]
},
{
"cell_type": "markdown",
"id": "70ac5fb9",
"metadata": {},
"source": [
"As we saw above in the log, some extra steps were done to mix in the type context to the CUI context."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
2 changes: 2 additions & 0 deletions notebooks/introductory/custom/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
This section uses more custom (non-default) components.
But the components are still a part of the core release.