Skip to content

feat: add Jupyter and Databricks notebook parsing support#69

Merged
tirth8205 merged 11 commits intotirth8205:mainfrom
michael-denyer:feat/notebook-support
Mar 31, 2026
Merged

feat: add Jupyter and Databricks notebook parsing support#69
tirth8205 merged 11 commits intotirth8205:mainfrom
michael-denyer:feat/notebook-support

Conversation

@michael-denyer
Copy link
Copy Markdown
Contributor

Summary

  • Add .ipynb (Jupyter/Databricks) notebook parsing — extracts functions, classes, imports, and calls from code cells across Python, SQL, R, and Scala kernels
  • Add Databricks .py notebook export parsing — detects # Databricks notebook source header and splits on # COMMAND ---------- markers
  • Extract SQL table references (FROM, JOIN, INTO, CREATE TABLE/VIEW) as import edges for cross-language lineage
  • Shared _parse_notebook_cells method handles multi-language cell dispatch with per-cell line offset tracking

Test plan

  • Jupyter .ipynb parsing with Python kernel cells
  • Databricks multi-language .ipynb with %python, %sql, %r, %scala magic commands
  • Databricks .py export format parsing
  • SQL table regex extraction tests
  • R-kernel notebook cells (xfail pending R language PR feat: add R language parsing support #43)
  • Edge cases: empty notebooks, non-code cells, malformed JSON

Extract code cells from .ipynb files, filter magic/shell commands,
concatenate with offset tracking, and parse as Python via tree-sitter.

Supports:
- Python kernel detection (phase 1)
- Magic command filtering (%pip, !ls)
- Cell index tracking in node.extra["cell_index"]
- Cross-cell function calls and imports
- Edge cases: empty notebooks, non-Python kernels, malformed JSON

Includes test fixture and 12 tests in TestNotebookParsing.
Split _parse_notebook into two methods:
- _parse_notebook: extracts cells from .ipynb JSON, builds list[CellInfo],
  delegates to _parse_notebook_cells
- _parse_notebook_cells: shared method that parses cells grouped by language
  (Python/R via Tree-sitter, SQL via regex)

Also expands supported notebook languages from Python-only to Python and R.
Updates test_non_python_kernel to use an actually unsupported language (Scala)
since R is now supported.
Detect and parse Databricks-exported .py notebooks (identified by the
'# Databricks notebook source' header). Splits on COMMAND delimiters,
classifies cells by MAGIC prefix (%sql, %r, %md, %sh), and delegates
to the existing _parse_notebook_cells shared method. SQL table refs,
Python functions, cross-cell calls, and cell_index tracking all work.
@michael-denyer michael-denyer force-pushed the feat/notebook-support branch from ace72bb to 2da0a49 Compare March 27, 2026 10:35
@tirth8205 tirth8205 merged commit ff392a3 into tirth8205:main Mar 31, 2026
1 check passed
loc4atnt pushed a commit to loc4atnt/code-review-graph that referenced this pull request Apr 9, 2026
Co-Authored-By: Michael Denyer <michael-denyer@users.noreply.github.com>
michael-denyer added a commit to michael-denyer/code-review-graph that referenced this pull request Apr 15, 2026
The parser gated CALLS edge emission on `enclosing_func` being set, so
calls made from module scope (top-level script glue, CLI entrypoints,
`if __name__ == "__main__"` blocks, and Jupyter/Databricks notebook
cells) produced zero CALLS edges. Any function invoked only from those
contexts was flagged as dead by `find_dead_code`, even when the
function was the entire reason the script existed.

Notebooks are particularly affected because every cell is module-scope
by definition, so the existing notebook parser (PR tirth8205#69) emitted nodes
and IMPORTS_FROM edges but no CALLS edges — making the dead-code
detector's notebook coverage vacuous.

Fix: when `enclosing_func` is None, attribute the CALLS edge to the
File node instead of dropping it. Matches the existing convention used
by `_extract_value_references` and CONTAINS edges. Applied to all 5
gated emission sites: generic Python/JS/TS path, JSX components,
Elixir, Solidity `emit`, and R.

Downstream: `detect_entry_points` now filters File-sourced CALLS via
`get_all_call_targets(include_file_sources=False)` so script-only
callees remain detectable as entry points (otherwise `run_job()`
called from `script.py` module scope would look "called" by `script.py`
and disappear from flow analysis).

Verified end-to-end against a Databricks `.ipynb` that calls
`Predict.extract_data_from_sample_ids()` from cell-level code: edge
count went from 0 to 14 CALLS edges, and `find_dead_code` no longer
flags the method.

Tests:
- `test_module_scope_calls_attributed_to_file` — bare `.py` script
- `test_module_scope_calls_in_notebook` — `.ipynb` file
- `test_detect_entry_points_module_scope_caller_is_still_root` — flow
  analysis treats File-sourced CALLS correctly
- `test_module_scope_caller_prevents_dead_code_flag` — end-to-end
  parse → store → find_dead_code
- `test_if_main_block_caller_prevents_dead_code_flag` — same for
  `__main__` block
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants