feat: add Jupyter and Databricks notebook parsing support by michael-denyer · Pull Request #69 · tirth8205/code-review-graph

michael-denyer · 2026-03-26T16:45:55Z

Summary

Add .ipynb (Jupyter/Databricks) notebook parsing — extracts functions, classes, imports, and calls from code cells across Python, SQL, R, and Scala kernels
Add Databricks .py notebook export parsing — detects # Databricks notebook source header and splits on # COMMAND ---------- markers
Extract SQL table references (FROM, JOIN, INTO, CREATE TABLE/VIEW) as import edges for cross-language lineage
Shared _parse_notebook_cells method handles multi-language cell dispatch with per-cell line offset tracking

Test plan

Jupyter .ipynb parsing with Python kernel cells
Databricks multi-language .ipynb with %python, %sql, %r, %scala magic commands
Databricks .py export format parsing
SQL table regex extraction tests
R-kernel notebook cells (xfail pending R language PR feat: add R language parsing support #43)
Edge cases: empty notebooks, non-code cells, malformed JSON

Extract code cells from .ipynb files, filter magic/shell commands, concatenate with offset tracking, and parse as Python via tree-sitter. Supports: - Python kernel detection (phase 1) - Magic command filtering (%pip, !ls) - Cell index tracking in node.extra["cell_index"] - Cross-cell function calls and imports - Edge cases: empty notebooks, non-Python kernels, malformed JSON Includes test fixture and 12 tests in TestNotebookParsing.

Split _parse_notebook into two methods: - _parse_notebook: extracts cells from .ipynb JSON, builds list[CellInfo], delegates to _parse_notebook_cells - _parse_notebook_cells: shared method that parses cells grouped by language (Python/R via Tree-sitter, SQL via regex) Also expands supported notebook languages from Python-only to Python and R. Updates test_non_python_kernel to use an actually unsupported language (Scala) since R is now supported.

Detect and parse Databricks-exported .py notebooks (identified by the '# Databricks notebook source' header). Splits on COMMAND delimiters, classifies cells by MAGIC prefix (%sql, %r, %md, %sh), and delegates to the existing _parse_notebook_cells shared method. SQL table refs, Python functions, cross-cell calls, and cell_index tracking all work.

Co-Authored-By: Michael Denyer <michael-denyer@users.noreply.github.com>

The parser gated CALLS edge emission on `enclosing_func` being set, so calls made from module scope (top-level script glue, CLI entrypoints, `if __name__ == "__main__"` blocks, and Jupyter/Databricks notebook cells) produced zero CALLS edges. Any function invoked only from those contexts was flagged as dead by `find_dead_code`, even when the function was the entire reason the script existed. Notebooks are particularly affected because every cell is module-scope by definition, so the existing notebook parser (PR tirth8205#69) emitted nodes and IMPORTS_FROM edges but no CALLS edges — making the dead-code detector's notebook coverage vacuous. Fix: when `enclosing_func` is None, attribute the CALLS edge to the File node instead of dropping it. Matches the existing convention used by `_extract_value_references` and CONTAINS edges. Applied to all 5 gated emission sites: generic Python/JS/TS path, JSX components, Elixir, Solidity `emit`, and R. Downstream: `detect_entry_points` now filters File-sourced CALLS via `get_all_call_targets(include_file_sources=False)` so script-only callees remain detectable as entry points (otherwise `run_job()` called from `script.py` module scope would look "called" by `script.py` and disappear from flow analysis). Verified end-to-end against a Databricks `.ipynb` that calls `Predict.extract_data_from_sample_ids()` from cell-level code: edge count went from 0 to 14 CALLS edges, and `find_dead_code` no longer flags the method. Tests: - `test_module_scope_calls_attributed_to_file` — bare `.py` script - `test_module_scope_calls_in_notebook` — `.ipynb` file - `test_detect_entry_points_module_scope_caller_is_still_root` — flow analysis treats File-sourced CALLS correctly - `test_module_scope_caller_prevents_dead_code_flag` — end-to-end parse → store → find_dead_code - `test_if_main_block_caller_prevents_dead_code_flag` — same for `__main__` block

michael-denyer added 10 commits March 27, 2026 10:35

docs: add Databricks notebook support spec and implementation plan

c36dc5a

feat(parser): add CellInfo and SQL table regex

a9f937e

test: add SQL table regex extraction tests

061c5fa

test: add Databricks multi-language .ipynb tests

5255d96

test: add R-kernel notebook and edge case tests

b9a49a7

test: mark R parser tests as xfail pending PR tirth8205#43

bc6cda8

fix: address ruff import sorting issues

2da0a49

michael-denyer force-pushed the feat/notebook-support branch from ace72bb to 2da0a49 Compare March 27, 2026 10:35

Merge remote-tracking branch 'origin/main' into feat/notebook-support

4ace7de

tirth8205 merged commit ff392a3 into tirth8205:main Mar 31, 2026
1 check passed

loc4atnt pushed a commit to loc4atnt/code-review-graph that referenced this pull request Apr 9, 2026

feat: add Jupyter and Databricks notebook parsing support (tirth8205#69)

0fb03a1

Co-Authored-By: Michael Denyer <michael-denyer@users.noreply.github.com>

This was referenced Apr 15, 2026

Module-scope calls (notebooks, scripts, __main__) emit no CALLS edges → find_dead_code false positives #284

Open

fix: emit CALLS edges for module-scope code (closes #284) #285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Jupyter and Databricks notebook parsing support#69

feat: add Jupyter and Databricks notebook parsing support#69
tirth8205 merged 11 commits intotirth8205:mainfrom
michael-denyer:feat/notebook-support

michael-denyer commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michael-denyer commented Mar 26, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants