Skip to content

Check for required environment before attempting installation#5609

Merged
rdspring1 merged 42 commits intomainfrom
prerequisite-validation
Dec 19, 2025
Merged

Check for required environment before attempting installation#5609
rdspring1 merged 42 commits intomainfrom
prerequisite-validation

Conversation

@csarofeen
Copy link
Collaborator

@csarofeen csarofeen commented Dec 1, 2025

Summary

Validates build prerequisites before CMake runs, replacing cryptic errors with actionable guidance.

Impact: Users get clear installation instructions instead of confusing CMake/linker errors.

What's Validated

Platform
Python 3.8+
CMake 3.18+
Ninja
pybind11[global]>=2.0
PyTorch 2.0+ with CUDA 12.8+
System CUDA toolkit (major version match)
Git submodules initialized
GCC 13+
LLVM 18.1+

Skip option: NVFUSER_BUILD_SKIP_VALIDATION=1 (for CI/custom setups)

Changes by File (Review Order)

Integration & Orchestration

python/setup.py (+20 lines)

  • Calls validate_prerequisites() before build
  • Catches exceptions, exits with error if validation fails
  • Skip option via environment variable
  • Updated header with NVFUSER_BUILD_SKIP_VALIDATION docs

python/tools/prereqs/validate.py (new, 126 lines)

  • Orchestrator: runs all 10 checks in order
  • Returns metadata dict
  • Prints success summary

Dependency Management

requirements.txt (+3 lines)

  • Added cmake>=3.18, version constraints
  • One command to install all build deps: pip install -r requirements.txt

python/pyproject.toml (+1 line)

  • Added pybind11[global]>=2.0 for build isolation

Validation Modules (9 new files, ~1300 lines)

python/tools/prereqs/__init__.py (new, 56 lines)

  • Package initialization, exports all checks

python/tools/prereqs/exceptions.py (new, 23 lines)

  • PrerequisiteMissingError exception type

python/tools/prereqs/platform.py (new, 116 lines)

  • Detects OS, architecture, Ubuntu-based distros

python/tools/prereqs/python_version.py (new, 115 lines)

  • Validates Python 3.8+

python/tools/prereqs/build_tools.py (new, 149 lines)

  • Validates CMake 3.18+, Ninja

python/tools/prereqs/python_packages.py (new, 334 lines)

  • Validates PyTorch 2.0+ with CUDA 12+
  • System CUDA toolkit validation: Detects via nvcc --version, enforces major version match
  • Validates pybind11[global]>=2.0 with CMake support

python/tools/prereqs/git.py (new, 142 lines)

  • Validates git submodules initialized
  • Returns empty list if not in git repo (allows pip install from tarball)

python/tools/prereqs/gcc.py (new, 165 lines)

  • Validates GCC 13+
  • Compile test: Actually tests #include <format> (not just version check)
  • Ubuntu PPA instructions for GCC 13 installation

python/tools/prereqs/llvm.py (new, 249 lines)

  • Validates LLVM 18.1+
  • Priority-based detection: env vars → PATH → system paths → project-local .llvm/
  • No-sudo install option (prebuilt binaries to .llvm/)

CMakeLists.txt

  • Changed LLVM resolution to include project-local .llvm/ install

- Add _detect_system_cuda() helper to detect nvcc version via subprocess
- Validate PyTorch CUDA major version matches system CUDA major version
- Error if major versions don't match (e.g., PyTorch CUDA 13 + system CUDA 12)
- Warn if minor versions don't match (e.g., PyTorch 12.1 + system 12.5)
- Error if nvcc not found with CUDA toolkit install instructions
- Update success message to show both PyTorch and system CUDA versions

Tested scenarios:
- PyTorch CUDA 12.1 vs system 12.5: warning (passes)
- PyTorch CUDA 12.6 vs system 12.5: warning (passes)
- PyTorch CUDA 13.0 vs system 12.5: error (major mismatch)

This completes the PyTorch validation to properly check CUDA compatibility
between PyTorch and the system CUDA toolkit used for building.
@csarofeen csarofeen changed the title Prerequisite validation Check for required environment before attempting installation Dec 1, 2025
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Review updated until commit e1a390b

Description

  • Add comprehensive prerequisite validation system with 10+ checks before build

  • Validate Python 3.8+, CMake 3.18+, Ninja, pybind11, PyTorch CUDA, GCC 13+, LLVM 18.1+

  • Provide actionable error messages with installation instructions for missing prerequisites

  • Add NVFUSER_BUILD_SKIP_VALIDATION option for CI/custom setups

Changes walkthrough

Relevant files
Enhancement
18 files
setup.py
Add prerequisite validation integration with skip option 
+23/-0   
__init__.py
Create prerequisite validation package with exports           
+96/-0   
validate.py
Add validation orchestrator for all prerequisite checks   
+168/-0 
requirements.py
Centralize version requirements and utilities                       
+269/-0 
python_version.py
Add Python 3.8+ version validation with platform guidance
+114/-0 
build_tools.py
Validate CMake 3.18+ and Ninja build tools                             
+140/-0 
python_packages.py
Validate pybind11 and PyTorch with CUDA compatibility checks
+375/-0 
compiler.py
Validate GCC 13+ or Clang 19+ with format header support 
+102/-0 
llvm.py
Validate LLVM 18.1+ with project-local installation support
+238/-0 
git.py
Check git submodules initialization status                             
+137/-0 
nccl.py
Detect NCCL headers/library for distributed builds             
+311/-0 
platform.py
Add platform detection for OS, architecture, and distribution
+129/-0 
exceptions.py
Define PrerequisiteMissingError exception type                     
+23/-0   
utils.py
Add NCCL include path detection for pip-bundled packages 
+34/-0   
setup.py
Add deprecation warning and prerequisite validation           
+30/-0   
CMakeLists.txt
Add NCCL include directory and project-local LLVM detection
+34/-0   
pyproject.toml
Add pybind11[global]>=2.0 build requirement                           
+1/-1     
requirements.txt
Add cmake>=3.18 and version constraints for build dependencies
+3/-1     
Documentation
1 files
README.md
Update compiler requirements and add validation documentation
+17/-4   

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Missing Error Handling

The validate_prerequisites() function calls multiple validation functions but doesn't handle ImportError gracefully for optional dependencies. If any of the validation modules fail to import (e.g., due to missing dependencies), the entire validation will fail with a cryptic ImportError instead of providing a clear message about what's missing.

def validate_prerequisites() -> Dict[str, Any]:
    """
    Validate all nvFuser build prerequisites in the correct order.

    This function runs all prerequisite checks sequentially and collects
    metadata about the system. If any check fails, it raises PrerequisiteMissingError
    with detailed instructions on how to fix the issue.

    Check order (fail-fast after platform detection):
    1. Platform detection (informational only)
    2. Python
    3. CMake
    4. Ninja
    5. PyTorch with CUDA (includes system CUDA validation)
    6. pybind11
    7. Git submodules initialized
    8. C++ compiler (GCC 13+ or Clang 19+) with <format> header
    9. NCCL headers/library (if distributed enabled)
    10. LLVM

    Returns:
        Dict[str, Any]: Dictionary containing metadata about all detected prerequisites

    Raises:
        PrerequisiteMissingError: If any prerequisite is missing or has wrong version

    Example:
        >>> metadata = validate_prerequisites()
        [nvFuser] Platform: Linux x86_64, Ubuntu 22.04
        [nvFuser] ✓ Python X.Y.Z >= {PYTHON.min_str}
        [nvFuser] ✓ CMake X.Y.Z >= {CMAKE.min_str}
        [nvFuser] ✓ Ninja X.Y.Z (any version)
        [nvFuser] ✓ PyTorch X.Y with CUDA X.Y >= {PYTORCH.min_str} with CUDA {CUDA.min_str}
        [nvFuser] ✓ pybind11 X.Y.Z >= {PYBIND11.min_str} with CMake support
        [nvFuser] ✓ Git submodules: N initialized
        [nvFuser] ✓ GCC X.Y.Z >= {GCC.min_str} with <format> header
        [nvFuser] ✓ NCCL found (headers: /path/to/nccl/include)
        [nvFuser] ✓ LLVM X.Y.Z >= {LLVM.min_str}

        ✓✓✓ All prerequisites validated ✓✓✓

        Note: Version requirements are defined in requirements.py.

        >>> metadata.keys()
        dict_keys(['platform', 'python', 'cmake', 'ninja', 'torch', 'cuda',
                   'pybind11', 'git_submodules', 'compiler', 'nccl', 'llvm'])
    """
Potential Performance Issue

The _get_pip_nccl_paths() function searches through all sys.path entries, which could be slow in environments with many site-packages directories. Consider caching the results or adding early termination once a valid NCCL installation is found.

def _get_pip_nccl_paths() -> Tuple[Optional[Path], Optional[Path]]:
    """
    Find NCCL headers and library from pip-installed nvidia-nccl-cu* package.

    PyTorch's pip package depends on nvidia-nccl-cu* which bundles:
    - {site-packages}/nvidia/nccl/include/nccl.h
    - {site-packages}/nvidia/nccl/lib/libnccl.so.2

    Note: Similar logic exists in utils.py::get_pip_nccl_include_dir() for the
    build system. This function returns both include AND lib paths for complete
    validation, while utils.py only needs the include path for CMake. The
    duplication is intentional to keep validation and build logic independent.

    Returns:
        Tuple of (include_path, lib_path) or (None, None) if not found

    Example:
        >>> inc, lib = _get_pip_nccl_paths()
        >>> inc
        PosixPath('/path/to/site-packages/nvidia/nccl/include')
    """
    # Search all site-packages directories
    for site_path in sys.path:
        if not site_path:
            continue
        nccl_include = Path(site_path) / "nvidia" / "nccl" / "include"
        nccl_lib = Path(site_path) / "nvidia" / "nccl" / "lib"

        header = nccl_include / "nccl.h"
        # Check for versioned library (libnccl.so.2) or unversioned
        lib_exists = (nccl_lib / "libnccl.so.2").exists() or (
            nccl_lib / "libnccl.so"
        ).exists()

        if header.exists() and lib_exists:
            return nccl_include, nccl_lib

    return None, None
Hardcoded Paths

The _find_llvm_config() function has hardcoded system paths and version ranges. This may not work correctly on all Linux distributions or may break when new LLVM versions are released. Consider making these more configurable or using a more robust detection method.

def _find_llvm_config() -> Optional[str]:
    """
    Locate llvm-config binary in order of priority.

    Priority:
    1. LLVM_CONFIG environment variable
    2. LLVM_DIR/bin/llvm-config environment variable (CMake convention)
    3. LLVM_ROOT/bin/llvm-config environment variable
    4. llvm-config on PATH
    5. System known locations
    6. Project-local locations (scanning for compatible versions)

    Returns:
        Optional[str]: Path to llvm-config if found, None otherwise

    Example:
        >>> llvm_config = _find_llvm_config()
        >>> llvm_config
        '/home/user/nvfuser/.llvm/18.1.8/bin/llvm-config'
    """
    candidates = []
    llvm_major = LLVM.min_version[0]  # e.g., 18

    # 1. Explicit LLVM_CONFIG env var
    if llvm_config_env := os.environ.get("LLVM_CONFIG"):
        candidates.append(llvm_config_env)

    # 2. LLVM_DIR (CMake convention)
    # CMake typically sets LLVM_DIR to lib/cmake/llvm or similar
    # Try multiple navigation patterns for robustness
    if llvm_dir := os.environ.get("LLVM_DIR"):
        llvm_dir_path = Path(llvm_dir)
        candidates.append(
            llvm_dir_path / ".." / ".." / ".." / "bin" / "llvm-config"
        )  # lib/cmake/llvm -> root/bin
        candidates.append(
            llvm_dir_path / ".." / ".." / "bin" / "llvm-config"
        )  # cmake/llvm -> root/bin
        candidates.append(
            llvm_dir_path / "bin" / "llvm-config"
        )  # if LLVM_DIR points to root

    # 3. LLVM_ROOT (alternative convention)
    if llvm_root := os.environ.get("LLVM_ROOT"):
        candidates.append(os.path.join(llvm_root, "bin", "llvm-config"))

    # 4. PATH lookup
    if llvm_in_path := shutil.which("llvm-config"):
        candidates.append(llvm_in_path)

    # 5. System known locations (use minimum major version)
    system_paths = [
        f"/usr/lib/llvm-{llvm_major}/bin/llvm-config",
        f"/usr/local/llvm-{llvm_major}/bin/llvm-config",
        "/opt/llvm/bin/llvm-config",
    ]
    candidates.extend(system_paths)

    # 6. Project-local locations (wildcards for minor version variations)
    # Navigate from python/tools/prereqs to repo root (3 levels up)
    repo_root = Path(__file__).resolve().parents[3]
    project_paths = []

    # Check for compatible versions in project locations
    for parent in [repo_root / ".llvm", repo_root / "third_party" / "llvm"]:
        if parent.exists():
            # Scan for compatible versions (minimum and above)
            for major in range(llvm_major, llvm_major + 3):  # e.g., 18, 19, 20
                for child in parent.glob(f"{major}.*"):
                    if child.is_dir():
                        project_paths.append(child / "bin" / "llvm-config")

    candidates.extend([str(p) for p in project_paths])

    # Try each candidate
    for candidate in candidates:
        if candidate:
            candidate_path = Path(candidate)
            if candidate_path.exists() and os.access(candidate_path, os.X_OK):
                return str(candidate_path)

    return None

Test failures

  • (Low, 1) Minor numerical mismatch in Thunder vs Torch instance_norm nvFuser CUDA tests on float32 (dlcluster_h100).

    Test Name H100 Source
    thunder.tests.test_ops.test_core_vs_torch_consistency_instance_norm_nvfuser_cuda_thunder.dtypes.float32

@rdspring1
Copy link
Collaborator

!build

@csarofeen csarofeen marked this pull request as ready for review December 14, 2025 03:16
@csarofeen
Copy link
Collaborator Author

!build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. python/utils.py, line 221 (link)

    logic: eval() on user-provided environment variable is a security risk - enables arbitrary code execution

    Need to import json at top of file

  2. python/utils.py, line 274 (link)

    logic: Same eval() security issue - arbitrary code execution via environment variable

20 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. setup.py, line 5-63 (link)

    style: Missing NVFUSER_BUILD_SKIP_VALIDATION documentation in header comments. Should add after line 63:

    #   NVFUSER_BUILD_SKIP_VALIDATION=1
    #     Skip prerequisite validation (for CI/custom setups)
    

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

20 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +285 to +287
pip_note = ""
pip_inc, pip_lib = _get_pip_nccl_paths()
if pip_inc is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Logic gap: checks pip_inc but should check if both header and library are missing from pip, since both are required for build success

@rdspring1 rdspring1 self-assigned this Dec 18, 2025
@rdspring1
Copy link
Collaborator

!test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +110 to +112
except ImportError as e:
# Prerequisite validation not available (shouldn't happen in dev)
print(f"WARNING: Could not import prerequisite validation: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: ImportError swallowed with warning defeats PR goal. Per the PR description, this feature exists to "replace cryptic errors with actionable guidance." If the import fails and validation is skipped, users get exactly the cryptic CMake/linker errors this PR aims to prevent.

The ImportError should only occur if tools.prereqs package is missing, which shouldn't happen in a proper git clone. If it does, that's a critical environment issue that should fail the build, not continue silently.

Is this handling intended for pip-installed tarballs where the validation module might be stripped? If so, document why ImportError is acceptable and under what conditions.

@rdspring1 rdspring1 merged commit c27d971 into main Dec 19, 2025
60 of 62 checks passed
@rdspring1 rdspring1 deleted the prerequisite-validation branch December 19, 2025 19:50
@naoyam
Copy link
Collaborator

naoyam commented Dec 19, 2025

I started seeing this build error:

  ============================================================
  [nvFuser] Validating build prerequisites...
  ============================================================

  ERROR: LLVM not found.

  nvFuser requires LLVM 18.1+ to build (for runtime Host IR JIT).
  llvm-config must be in PATH or at a known location.

  Installation options:

  Option 1: Download prebuilt binaries (recommended, no sudo needed, project-local):
    cd /tmp/nvfuser/tensorindexer_on_by_default  # your nvfuser repo root
    mkdir -p .llvm
    cd .llvm
    wget https://github.com/llvm/llvm-project/releases/download/llvmorg-18.1.8/clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
    tar -xf clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
    mv clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04 18.1.8
    # Then set environment variable:
    export LLVM_CONFIG=$(pwd)/18.1.8/bin/llvm-config
    # Install legacy library libtinfo5 if missing
    wget http://mirrors.kernel.org/ubuntu/pool/universe/n/ncurses/libtinfo5_6.3-2ubuntu0.1_amd64.deb
    sudo apt install ./libtinfo5_6.3-2ubuntu0.1_amd64.deb

  Option 2: Install from LLVM APT repository (requires sudo):
    # Install prerequisites
    sudo apt install libzstd1 libzstd-dev lsb-release wget software-properties-common gnupg
    wget https://apt.llvm.org/llvm.sh
    chmod +x llvm.sh
    sudo ./llvm.sh 18
    # llvm-config-18 will be installed at /usr/lib/llvm-18/bin/llvm-config
    export LLVM_CONFIG=/usr/lib/llvm-18/bin/llvm-config

Reverting this PR seems to make the error gone.

@csarofeen
Copy link
Collaborator Author

@naoyam What system are you on and where is your llvm version installed?

@csarofeen
Copy link
Collaborator Author

Also, you can disable the checks with: NVFUSER_BUILD_SKIP_VALIDATION=1

@rdspring1 could you please add how to disable the check to all the error messages

@mdavis36 mdavis36 mentioned this pull request Dec 20, 2025
wujingyue pushed a commit that referenced this pull request Dec 20, 2025
@rdspring1
Copy link
Collaborator

@wujingyue @mdavis36 @xwang233

I’ll let you take over this PR since you know the build systems better than I do.

mdavis36 added a commit that referenced this pull request Dec 26, 2025
mdavis36 added a commit that referenced this pull request Jan 15, 2026
# Summary

This PR aims to aid new users in setting up and installing nvFuser
successfully. This is done by providing user a comprehensive python
report (based on #5609) as early as possible in the build process:
- Clear nvFuser library dependencies & constraints
- Minimum version (when applicable)
- Optional vs Required dependencies
- Actionable output to user on why a requirement is enforced and how to
rectify it.

## Differences (This PR vs #5609)

The outcome of the report is **determined by CMake's evaluation of the
constraints** we place on requirements. The report has **no effect** on
the ability to build nvFuser. All failure logic is define by the CMake
system.

The report scripts are used to aid in formatting and printing pertinent
information to the user. This is done by directly referencing CMake
variables in python and allowing python to handle complicated string
manipulation and formatting (which CMake is really bad at...).

The contents of the help messages largely remains the same as #5609.
Giving user guidance based on their build platform.

## CMake Changes

- `cmake/DependencyRequirements.cmake` is the single source of truth for
version requirements, components and the state of `OPTIONAL` for each
dependency.
- Option `NVFUSER_ENABLE_DEPENDENCY_REPORT` is by default `ON`. If this
is set `OFF` then dependencies will be evaluated as "normal" in CMake
and the build configuration will exit of the **first failure**.
- Each requirements logic is defined in it's own
`cmake/deps/handle_<name>.cmake` file for some organization/clarity.

### Success Case
- CMake dependency evaluation happens silently and is written to buffer.
- Python report is generated as early as possible.
- **On first run**: CMake will always look for compilers for the
`LANGUAGES` the project is built for first - this cant be skipped AFAIK.
- **On subsequent runs**: the python report is displayed immediately
(compiler information is cached).
- CMake output is dumped to the user for detailed reporting (this is the
same as when running with `NVFUSER_ENABLE_DEPENDECY_REPORT=Off`)
<img width="869" height="1562" alt="image"
src="https://github.com/user-attachments/assets/7c4fddc5-2409-473d-bab9-0203e66fa11c"
/>


### Failure Case (example : pybind11 version too low)
Report fails with installation instructions for users.
- Does not `FATAL_ERROR` when pybind11 mismatches.
- CMake still dumps the output evaluating **ALL** dependencies
- CMake exits after reporting detailed output. 
<img width="869" height="1440" alt="image"
src="https://github.com/user-attachments/assets/9d1c5134-31d8-4050-9c0d-5ae2ad71dc71"
/>

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants