Add a Google Colab notebook of the protocol

benmwebb · benmwebb · commit 09f940cc170c · 2025-11-12T14:31:42.000-08:00
diff --git a/notebook/ModLoop.ipynb b/notebook/ModLoop.ipynb
@@ -0,0 +1,323 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#ModLoop\n",
+        "\n",
+        "ModLoop is a protocol for automated modeling of loops in protein structures. The server relies on the loop modeling routine in [MODELLER](https://salilab.org/modeller/) that predicts the loop conformations by satisfaction of spatial restraints, without relying on a database of known protein structures.\n",
+        "\n",
+        "The ModLoop protocol can be run in Google Colab. The first step is to install the Modeller software by running the cell below:"
+      ],
+      "metadata": {
+        "id": "NtLLmo9CCCOA"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "5uAWPuZzXlUq"
+      },
+      "outputs": [],
+      "source": [
+        "# Install Modeller from Sali lab website\n",
+        "modver = \"10.8\"\n",
+        "!wget \"https://salilab.org/modeller/{modver}/modeller_{modver}-1_amd64.deb\"\n",
+        "!apt install \"./modeller_{modver}-1_amd64.deb\"\n",
+        "!rm \"modeller_{modver}-1_amd64.deb\"\n",
+        "# Add Modeller Python modules to Colab's Python path\n",
+        "import sys\n",
+        "sys.path.append(\"/usr/lib/python3.9/dist-packages\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Next, configure the protocol by running the following cell and giving the [MODELLER license key](https://modbase.compbio.ucsf.edu/modloop/help#modkey), uploading the [starting structure](https://modbase.compbio.ucsf.edu/modloop/help#file) in PDB or mmCIF format, and specifying the [loops to refine](https://modbase.compbio.ucsf.edu/modloop/help#loop):"
+      ],
+      "metadata": {
+        "id": "8G2BFSeTDSy9"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# @title\n",
+        "import ipywidgets as widgets\n",
+        "from ipywidgets import GridspecLayout\n",
+        "grid = GridspecLayout(3, 2)\n",
+        "grid[0, 0] = widgets.Label(\"Modeller license key\")\n",
+        "grid[0, 1] = key = widgets.Text()\n",
+        "grid[1, 0] = widgets.Label(\"Upload coordinate file\")\n",
+        "grid[1, 1] = coord = widgets.FileUpload(multiple=False)\n",
+        "grid[2, 0] = widgets.Label(\"Enter loop segments\")\n",
+        "grid[2, 1] = loops = widgets.Textarea()\n",
+        "grid"
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "-nW-3oUNYadt"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We use the input data to set up the protocol:"
+      ],
+      "metadata": {
+        "id": "1atrM6BBEt6o"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Add Modeller license key\n",
+        "with open(f\"/usr/lib/modeller{modver}/modlib/modeller/config.py\") as fh:\n",
+        "    inst_dir = fh.readline()\n",
+        "with open(f\"/usr/lib/modeller{modver}/modlib/modeller/config.py\", \"w\") as fh:\n",
+        "    fh.write(inst_dir)\n",
+        "    fh.write(f'license = {key.value!r}\\n')\n",
+        "\n",
+        "# Save uploaded file to local disk\n",
+        "in_fname = list(coord.value.keys())[0]\n",
+        "with open(in_fname, 'wb') as fh:\n",
+        "    fh.write(coord.value[in_fname]['content'])\n",
+        "\n",
+        "def parse_loop_selection(loops):\n",
+        "    \"\"\"Split out loop selection and check it\"\"\"\n",
+        "    import re\n",
+        "    # capitalize and remove spaces\n",
+        "    loops = re.sub(r'\\s+', '', loops.upper())\n",
+        "    # replace null chain IDs with a single space\n",
+        "    loops = loops.replace(\"::\", \": :\")\n",
+        "\n",
+        "    loop_data = loops.split(\":\")[:-1]\n",
+        "\n",
+        "    # Make sure correct number of colons were given\n",
+        "    if len(loop_data) % 4 != 0:\n",
+        "        raise ValueError(\n",
+        "            \"Syntax error in loop selection: check to make sure you \"\n",
+        "            \"have colons in the correct place (there should be a \"\n",
+        "            \"multiple of 4 colons)\")\n",
+        "\n",
+        "    total_res = 0\n",
+        "    start_res = []\n",
+        "    start_id = []\n",
+        "    end_res = []\n",
+        "    end_id = []\n",
+        "    loops = 0\n",
+        "    while loops*4+3 < len(loop_data) and loop_data[loops*4] != \"\":\n",
+        "        try:\n",
+        "            start_res.append(int(loop_data[loops*4]))\n",
+        "            end_res.append(int(loop_data[loops*4+2]))\n",
+        "        except ValueError:\n",
+        "            raise ValueError(\n",
+        "                \"Residue indices are not numeric\")\n",
+        "        start_id.append(loop_data[loops*4+1])\n",
+        "        end_id.append(loop_data[loops*4+3])\n",
+        "        # all the selected residues\n",
+        "        total_res += (end_res[-1] - start_res[-1] + 1)\n",
+        "\n",
+        "        ################################\n",
+        "        # too long loops rejected\n",
+        "        if ((end_res[-1] - start_res[-1]) > 20\n",
+        "                or start_id[-1] != end_id[-1]\n",
+        "                or (end_res[-1] - start_res[-1]) < 0):\n",
+        "            raise ValueError(\n",
+        "                \"The loop selected is too long (>20 residues) or \"\n",
+        "                \"shorter than 1 residue or not selected properly \"\n",
+        "                \"(syntax problem?) \"\n",
+        "                \"starting position %d:%s, ending position: %d:%s\"\n",
+        "                % (start_res[-1], start_id[-1], end_res[-1], end_id[-1]))\n",
+        "        loops += 1\n",
+        "\n",
+        "    ################################\n",
+        "    # too many or no residues rejected\n",
+        "    if total_res > 20:\n",
+        "        raise ValueError(\n",
+        "            \"Too many loop residues have been selected \"\n",
+        "            \" (selected: %d > limit:20)!\" % total_res)\n",
+        "    if total_res <= 0:\n",
+        "        raise ValueError(\n",
+        "            \"No loop residues selected!\")\n",
+        "    return loop_data\n",
+        "\n",
+        "def get_output_header(loop_data, nmodel):\n",
+        "    \"\"\"Return a suitable header for output model files\"\"\"\n",
+        "    residue_range = []\n",
+        "    for i in range(0, len(loop_data), 4):\n",
+        "        residue_range.append(\"   %s:%s-%s:%s\" % tuple(loop_data[i:i + 4]))\n",
+        "    looplist = \"\\n\".join(residue_range)\n",
+        "    return f\"\"\"\n",
+        "Dear User,\n",
+        "\n",
+        "Coordinates for the lowest energy model (out of {nmodel} sampled)\n",
+        "are returned with the optimized loop regions, listed below:\n",
+        "{looplist}\n",
+        "\n",
+        "for references please cite these two articles:\n",
+        "\n",
+        "   A Fiser, RKG Do and A Sali,\n",
+        "   Modeling of loops in protein structures\n",
+        "   Prot. Sci. (2000) 9, 1753-1773\n",
+        "\n",
+        "   A Fiser and A Sali,\n",
+        "   ModLoop: Automated modeling of loops in protein structures\n",
+        "   Bioinformatics. (2003) 18(19) 2500-01\n",
+        "\n",
+        "\n",
+        "For further inquiries, please contact: modloop@ucsf.edu\n",
+        "\n",
+        "with best regards,\n",
+        "Andras Fiser\n",
+        "\n",
+        "\n",
+        "\"\"\"\n",
+        "\n",
+        "def add_loop_header(model, loop_data, nmodel):\n",
+        "    \"\"\"Add a header to the given model PDB or mmCIF file\"\"\"\n",
+        "    with open(model) as fin:\n",
+        "        contents = fin.read()\n",
+        "    prefix = '#' if model.endswith('.cif') else 'REMARK'\n",
+        "    with open(model, 'w') as fout:\n",
+        "        for line in get_output_header(loop_data, nmodel).split('\\n'):\n",
+        "            if line == '':\n",
+        "                fout.write(prefix + '\\n')\n",
+        "            else:\n",
+        "                fout.write(f'{prefix}     {line}\\n')\n",
+        "        fout.write(contents)\n",
+        "\n",
+        "# Check the provided set of loop residues to refine\n",
+        "loop_data = parse_loop_selection(loops.value)"
+      ],
+      "metadata": {
+        "id": "SMGFauYYZlu6"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The loop modeling protocol itself is just a short Python script that runs MODELLER on the input file uploaded earlier, selecting the loop residues given:"
+      ],
+      "metadata": {
+        "id": "hek0hWnsFJgR"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from modeller import Environ, Selection, ModellerError\n",
+        "from modeller.automodel import LoopModel, refine\n",
+        "import sys\n",
+        "\n",
+        "class MyLoop(LoopModel):\n",
+        "    def select_loop_atoms(self):\n",
+        "        rngs = []\n",
+        "        for i in range(0, len(loop_data), 4):\n",
+        "            rngs.append(self.residue_range(\"%s:%s\" % tuple(loop_data[i:i+2]),\n",
+        "                                           \"%s:%s\" % tuple(loop_data[i+2:i+4])))\n",
+        "            if len(rngs[-1]) > 30:\n",
+        "                raise ModellerError(\"loop too long\")\n",
+        "        s = Selection(rngs)\n",
+        "        if len(s.only_no_topology()) > 0:\n",
+        "            raise ModellerError(\"some selected residues have no topology\")\n",
+        "        return s\n",
+        "\n",
+        "def make_loop(taskid):\n",
+        "    logfile = f'{taskid}.log'\n",
+        "    print(f'Logging output to {logfile}')\n",
+        "    old_sys_stdout = sys.stdout\n",
+        "    try:\n",
+        "        sys.stdout = open(logfile, 'w')\n",
+        "        env = Environ(rand_seed=-1000-taskid)\n",
+        "        m = MyLoop(env, inimodel=in_fname, sequence='loop')\n",
+        "        if in_fname.endswith('.cif'):\n",
+        "            m.set_output_model_format('MMCIF')\n",
+        "        else:\n",
+        "            m.set_output_model_format('PDB')\n",
+        "        m.loop.md_level = refine.slow\n",
+        "        m.loop.starting_model = m.loop.ending_model = taskid\n",
+        "        m.make()\n",
+        "        return m.loop.outputs[0]\n",
+        "    finally:\n",
+        "        sys.stdout = old_sys_stdout"
+      ],
+      "metadata": {
+        "id": "gCfs5jsRhXPo"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Finally, we can run the protocol in parallel. The exact same protocol is run 300 times on the same inputs but with a different random seed. (This will run faster if given a CPU with more cores.) The single structure with the lowest molecular PDF (molpdf) is then selected."
+      ],
+      "metadata": {
+        "id": "4n58sMJMFVtL"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import multiprocessing\n",
+        "import operator\n",
+        "\n",
+        "nmodel = 300\n",
+        "with multiprocessing.Pool() as pool:\n",
+        "    best_model = min(pool.imap_unordered(make_loop, range(1, nmodel+1)),\n",
+        "                     key=operator.itemgetter('molpdf'))\n",
+        "print(f\"Best model is {best_model['name']}\")\n",
+        "\n",
+        "# Add an informative ModLoop header to the best model PDB/mmCIF file\n",
+        "add_loop_header(best_model['name'], loop_data, nmodel)"
+      ],
+      "metadata": {
+        "collapsed": true,
+        "id": "6lejFefP0w-H"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Run the cell below to download the selected structure:"
+      ],
+      "metadata": {
+        "id": "0vpeZJEBGHtu"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from google.colab import files\n",
+        "files.download(best_model['name'])\n"
+      ],
+      "metadata": {
+        "id": "9_sIwcR69lww"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}