Skip to content

Add get_feature_names#4

Open
marcomq wants to merge 4 commits intoaryehlev:publishfrom
marcomq:publish
Open

Add get_feature_names#4
marcomq wants to merge 4 commits intoaryehlev:publishfrom
marcomq:publish

Conversation

@marcomq
Copy link

@marcomq marcomq commented Dec 1, 2025

  • adding rust functions to get feature names
    • get_feature_names
    • get_float_feature_names
    • get_cat_feature_names
    • get_text_feature_names
    • get_embedding_feature_names

I already created a PR in the original catboost repo. But I also found this code here and not sure yet which crate I should use.

Summary by CodeRabbit

  • New Features

    • Added methods to retrieve feature names by type: float, categorical, text, and embedding.
    • Added a method to retrieve all feature names from a model.
  • Bug Fixes

    • Improved stability when retrieving feature name lists; no changes to prediction behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 1, 2025

Walkthrough

Added libc runtime dependency and multiple FFI helpers plus five public APIs in src/model.rs to read feature names from the C model; the helper/public-api block was accidentally inserted twice in the file.

Changes

Cohort / File(s) Summary
Dependency Management
Cargo.toml
Added runtime dependency libc = "0.2.178".
Feature Name Retrieval APIs
src/model.rs
Added public methods: get_float_feature_names(), get_cat_feature_names(), get_text_feature_names(), get_embedding_feature_names(), and get_feature_names() to expose feature-name retrieval from the native model.
Internal Conversion & FFI Helpers
src/model.rs
Added unsafe FFI helpers to convert C-allocated buffers/strings into Rust Vec/String and free C memory: from_c_allocated_buffer<T>, get_feature_indices_from_c(), get_feature_names_from_c(), get_specific_feature_names(), get_feature_indices(), and a driver get_feature_names().
Duplication Notice
src/model.rs
The helper and public-API block is duplicated within the file, producing repeated declarations that should be reconciled.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Focus review on unsafe FFI logic, pointer/null checks, and ownership transfer between C and Rust.
  • Verify correct freeing order for inner C strings and outer arrays to avoid leaks or double-free.
  • Remove or reconcile duplicated blocks to prevent duplicate-symbols/duplicate-definition compilation or API export issues.
  • Confirm feature-flag gating (catboost_feature_indices) is applied consistently around new public APIs.

Possibly related PRs

  • support all catboost versions. #2 — Adds/emits the catboost_feature_indices feature-flag infrastructure that the new APIs are gated by; directly related to build/feature gating.

Poem

🐰 I hopped through pointers by moonlit streams,
I freed C crumbs and harvested names in dreams,
Five lists I gathered, tidy and bright,
Libc lent a hand through the quiet night,
Hooray — the model's labels now in sight! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add get_feature_names' directly summarizes the main change: adding feature name retrieval APIs to the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/model.rs (2)

325-337: Guard assumptions when mapping feature indices to names

The pattern of:

  • calling get_feature_names() to get all_names, then
  • calling get_feature_indices(...) and indexing all_names[i]

is clean, but it rests entirely on the C API guaranteeing that:

  • every index returned by indices_fn is < all_names.len(), and
  • the index space used by Get*FeatureIndices matches that of GetModelUsedFeaturesNames.

If that contract is ever violated (version skew, bug, or undocumented change), all_names[i] will panic or read garbage.

Suggested follow‑ups:

  • Add a short debug‑only sanity check before indexing, e.g. debug_assert!(i < all_names.len());.
  • Document in the get_specific_feature_names doc comment that it assumes indices are in bounds for the array returned by GetModelUsedFeaturesNames.

The rest of the FFI wrapper (indices_fn signature, indices_ptr/count handling) looks consistent with the surrounding code.

Also applies to: 356-371


373-403: Type‑specific feature name accessors are clean; consider version gating for text/embedding APIs

The four type‑specific getters:

  • get_float_feature_names
  • get_cat_feature_names
  • get_text_feature_names
  • get_embedding_feature_names

nicely reuse get_specific_feature_names and the corresponding sys::Get*FeatureIndices functions, keeping the surface API small and consistent.

Two non‑blocking concerns to keep in mind:

  1. Version/feature availability

    • GetTextFeatureIndices and GetEmbeddingFeatureIndices may not exist in older CatBoost binaries, while this Rust code calls them unconditionally (unlike the get_*_features_count methods, which are behind #[cfg(...)] flags).
    • If you intend to support multiple CatBoost versions, you may want to mirror the #[cfg(catboost_text_count)] / #[cfg(catboost_embeddings)] gating here, or clearly document that this crate now targets a minimum CatBoost version where these functions are guaranteed to exist.
  2. Failure mode if indices retrieval fails

    • Right now, any failure in the underlying Get*FeatureIndices or GetModelUsedFeaturesNames will bubble up via CatBoostError, which is appropriate, but it might be worth mentioning in the Rustdoc that these functions can error if the underlying C API doesn’t support them or returns an error.

Overall, the API design is good; I’d just align the version gating/documentation with how you already handle text/embedding feature counts.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5693195 and 5d9e87a.

📒 Files selected for processing (2)
  • Cargo.toml (1 hunks)
  • src/model.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/model.rs (1)
src/error.rs (1)
  • check_return_value (15-21)
🔇 Additional comments (2)
Cargo.toml (1)

13-15: libc dependency addition looks appropriate; just confirm version and usage scope

Adding libc as a runtime dependency to get access to free is a standard pattern for FFI with a C library and fits the new helpers in model.rs. Please just double‑check that:

  • this exact 0.2.177 version aligns with your intended MSRV and other deps, and
  • you don’t need to gate it behind any feature flags (e.g., if you later introduce a no-ffi or similar mode).
src/model.rs (1)

339-354: get_feature_names FFI pattern is sound; ensure CatBoost version compatibility

The implementation of get_feature_names:

  • correctly initializes names_ptr/count,
  • uses GetModelUsedFeaturesNames once, checks the boolean return via CatBoostError::check_return_value, and
  • delegates pointer conversion and freeing to get_feature_names_from_c.

This is a solid, minimal wrapper.

One thing to watch: GetModelUsedFeaturesNames is only available in newer versions of the CatBoost evaluation library. If your build script still supports older CatBoost binaries, this symbol (and possibly the corresponding header declaration) might be missing, causing compilation or link failures.

It would be good to:

  • Explicitly state in the crate docs or README that these APIs require a sufficiently new CatBoost evaluation library, or
  • Gate them behind an appropriate Cargo feature / #[cfg(...)] tied to the CatBoost version you download in build.rs.

Comment on lines 278 to 297
fn get_feature_indices_from_c(
indices_ptr: *mut usize,
count: usize,
err_msg: &str,
) -> CatBoostResult<Vec<usize>> {
if indices_ptr.is_null() {
if count == 0 {
return Ok(Vec::new());
}
return Err(CatBoostError {
description: err_msg.to_owned(),
});
}
let mut indices = Vec::with_capacity(count);
for i in 0..count {
indices.push(unsafe { *indices_ptr.add(i) });
}
unsafe { libc::free(indices_ptr as *mut _) };
Ok(indices)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add documentation confirming the C API memory ownership contract

The get_feature_indices_from_c and get_feature_names_from_c functions both use libc::free() to deallocate pointers returned from the CatBoost C API, which requires verifying and documenting the memory ownership contract:

  • Confirm in CatBoost C API documentation or comments that these functions allocate with malloc-compatible allocators and expect callers to deallocate with free
  • Add a SAFETY comment in each helper function explaining this assumption
  • Consider centralizing the pattern in a single unsafe wrapper (e.g., from_c_allocated_buffer<T>) to reduce duplication and make the contract explicit

Without documented verification of the allocator compatibility, this represents a potential source of undefined behavior (heap corruption/use-after-free), particularly if CatBoost or its dependencies use custom allocators or library-specific deallocators.

🤖 Prompt for AI Agents
In src/model.rs around lines 278 to 297, both get_feature_indices_from_c and
get_feature_names_from_c call libc::free on pointers returned by the CatBoost C
API without documenting the allocator contract; update the code by (1) adding a
SAFETY comment above each helper stating that the CatBoost C API must return
malloc-compatible buffers that the caller must free with libc::free (cite the
exact CatBoost C API doc or source line that confirms this), (2) centralize the
pattern by implementing a single unsafe helper (e.g.,
from_c_allocated_buffer<T>) that reads count elements, frees the pointer with
libc::free, and returns a Vec<T>, and (3) replace the existing functions to call
that helper so the ownership contract and unsafe boundary are explicit and
documented.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/model.rs (4)

278-303: Excellent safety documentation—past concerns addressed.

The comprehensive safety documentation and centralized memory management pattern directly addresses the previous review comment about documenting the C API memory ownership contract. The implementation correctly copies data before freeing and handles null pointers appropriately.

Optional: Consider using std::slice::from_raw_parts for a more idiomatic conversion:

unsafe fn from_c_allocated_buffer<T: Copy>(ptr: *mut T, count: usize) -> Vec<T> {
    if ptr.is_null() {
        return Vec::new();
    }
    let result = unsafe { std::slice::from_raw_parts(ptr, count).to_vec() };
    unsafe { libc::free(ptr as *mut _) };
    result
}

356-380: Consider adding feature gate to prevent dead code.

This helper function is only called by feature-gated public APIs and itself calls get_feature_names() which is also feature-gated. Adding the same gate here would prevent dead code warnings when the feature is disabled.

Apply this diff:

     /// Get names of specific type of features used in model,
     /// returns error if index out of bounds
+    #[cfg(catboost_feature_indices)]
     fn get_specific_feature_names(

382-415: Public API looks good; private helper could also use feature gate.

get_feature_names is properly implemented with appropriate error handling and memory management. The get_feature_indices helper has the same dead code consideration as get_specific_feature_names—it's only used by feature-gated code but isn't itself gated.

Consider adding #[cfg(catboost_feature_indices)] to get_feature_indices (line 400) as well to maintain consistency with get_specific_feature_names.


417-451: LGTM! Consistent public API implementation.

All four feature name getter functions follow a consistent pattern with proper feature gating and error handling. The implementation correctly delegates to get_specific_feature_names with the appropriate C function for each feature type.

Optional: The doc comments could be enhanced with information about return values and potential failure cases. For example:

/// Get names of float features used in model
///
/// # Returns
/// A vector of feature names, or an error if:
/// - The C API call fails
/// - Feature indices are out of bounds
///
/// # Example
/// ```ignore
/// let float_names = model.get_float_feature_names()?;
/// ```
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68ce50e and 2831014.

📒 Files selected for processing (2)
  • Cargo.toml (1 hunks)
  • src/model.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/model.rs (1)
src/error.rs (1)
  • check_return_value (15-21)
🔇 Additional comments (2)
src/model.rs (1)

305-354: LGTM! Proper C memory management for both simple and nested structures.

Both functions correctly handle the C memory lifecycle:

  • get_feature_indices_from_c uses the centralized helper for flat arrays
  • get_feature_names_from_c properly handles the 2D structure by freeing each inner string pointer before freeing the outer array
  • Null pointer and empty cases are handled correctly
Cargo.toml (1)

13-14: The libc version 0.2.178 is valid and free from known vulnerabilities.

Version 0.2.178 is the latest stable release on crates.io with no security advisories in the RustSec database. The dependency is appropriate for FFI memory management operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments