Add get_feature_names by marcomq · Pull Request #4 · aryehlev/catboost-rust

marcomq · 2025-12-01T18:17:38Z

adding rust functions to get feature names
- get_feature_names
- get_float_feature_names
- get_cat_feature_names
- get_text_feature_names
- get_embedding_feature_names

I already created a PR in the original catboost repo. But I also found this code here and not sure yet which crate I should use.

Summary by CodeRabbit

New Features
- Added methods to retrieve feature names by type: float, categorical, text, and embedding.
- Added a method to retrieve all feature names from a model.
Bug Fixes
- Improved stability when retrieving feature name lists; no changes to prediction behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-01T18:17:46Z

Walkthrough

Added libc runtime dependency and multiple FFI helpers plus five public APIs in src/model.rs to read feature names from the C model; the helper/public-api block was accidentally inserted twice in the file.

Changes

Cohort / File(s)	Summary
Dependency Management `Cargo.toml`	Added runtime dependency `libc = "0.2.178"`.
Feature Name Retrieval APIs `src/model.rs`	Added public methods: `get_float_feature_names()`, `get_cat_feature_names()`, `get_text_feature_names()`, `get_embedding_feature_names()`, and `get_feature_names()` to expose feature-name retrieval from the native model.
Internal Conversion & FFI Helpers `src/model.rs`	Added unsafe FFI helpers to convert C-allocated buffers/strings into Rust `Vec`/`String` and free C memory: `from_c_allocated_buffer<T>`, `get_feature_indices_from_c()`, `get_feature_names_from_c()`, `get_specific_feature_names()`, `get_feature_indices()`, and a driver `get_feature_names()`.
Duplication Notice `src/model.rs`	The helper and public-API block is duplicated within the file, producing repeated declarations that should be reconciled.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Focus review on unsafe FFI logic, pointer/null checks, and ownership transfer between C and Rust.
Verify correct freeing order for inner C strings and outer arrays to avoid leaks or double-free.
Remove or reconcile duplicated blocks to prevent duplicate-symbols/duplicate-definition compilation or API export issues.
Confirm feature-flag gating (catboost_feature_indices) is applied consistently around new public APIs.

Possibly related PRs

support all catboost versions. #2 — Adds/emits the catboost_feature_indices feature-flag infrastructure that the new APIs are gated by; directly related to build/feature gating.

Poem

🐰 I hopped through pointers by moonlit streams,
I freed C crumbs and harvested names in dreams,
Five lists I gathered, tidy and bright,
Libc lent a hand through the quiet night,
Hooray — the model's labels now in sight! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add get_feature_names' directly summarizes the main change: adding feature name retrieval APIs to the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/model.rs (2)

325-337: Guard assumptions when mapping feature indices to names

The pattern of:

calling get_feature_names() to get all_names, then

calling get_feature_indices(...) and indexing all_names[i]

is clean, but it rests entirely on the C API guaranteeing that:

every index returned by indices_fn is < all_names.len(), and

the index space used by Get*FeatureIndices matches that of GetModelUsedFeaturesNames.

If that contract is ever violated (version skew, bug, or undocumented change), all_names[i] will panic or read garbage.

Suggested follow‑ups:

Add a short debug‑only sanity check before indexing, e.g. debug_assert!(i < all_names.len());.

Document in the get_specific_feature_names doc comment that it assumes indices are in bounds for the array returned by GetModelUsedFeaturesNames.

The rest of the FFI wrapper (indices_fn signature, indices_ptr/count handling) looks consistent with the surrounding code.

Also applies to: 356-371

373-403: Type‑specific feature name accessors are clean; consider version gating for text/embedding APIs

The four type‑specific getters:

get_float_feature_names

get_cat_feature_names

get_text_feature_names

get_embedding_feature_names

nicely reuse get_specific_feature_names and the corresponding sys::Get*FeatureIndices functions, keeping the surface API small and consistent.

Two non‑blocking concerns to keep in mind:

Version/feature availability

GetTextFeatureIndices and GetEmbeddingFeatureIndices may not exist in older CatBoost binaries, while this Rust code calls them unconditionally (unlike the get_*_features_count methods, which are behind #[cfg(...)] flags).

If you intend to support multiple CatBoost versions, you may want to mirror the #[cfg(catboost_text_count)] / #[cfg(catboost_embeddings)] gating here, or clearly document that this crate now targets a minimum CatBoost version where these functions are guaranteed to exist.

Failure mode if indices retrieval fails

Right now, any failure in the underlying Get*FeatureIndices or GetModelUsedFeaturesNames will bubble up via CatBoostError, which is appropriate, but it might be worth mentioning in the Rustdoc that these functions can error if the underlying C API doesn’t support them or returns an error.

Overall, the API design is good; I’d just align the version gating/documentation with how you already handle text/embedding feature counts.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5693195 and 5d9e87a.

📒 Files selected for processing (2)

Cargo.toml (1 hunks)
src/model.rs (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/model.rs (1)

src/error.rs (1)

check_return_value (15-21)

🔇 Additional comments (2)

Cargo.toml (1)

13-15: libc dependency addition looks appropriate; just confirm version and usage scope

Adding libc as a runtime dependency to get access to free is a standard pattern for FFI with a C library and fits the new helpers in model.rs. Please just double‑check that:

this exact 0.2.177 version aligns with your intended MSRV and other deps, and

you don’t need to gate it behind any feature flags (e.g., if you later introduce a no-ffi or similar mode).

src/model.rs (1)

339-354: get_feature_names FFI pattern is sound; ensure CatBoost version compatibility

The implementation of get_feature_names:

correctly initializes names_ptr/count,

uses GetModelUsedFeaturesNames once, checks the boolean return via CatBoostError::check_return_value, and

delegates pointer conversion and freeing to get_feature_names_from_c.

This is a solid, minimal wrapper.

One thing to watch: GetModelUsedFeaturesNames is only available in newer versions of the CatBoost evaluation library. If your build script still supports older CatBoost binaries, this symbol (and possibly the corresponding header declaration) might be missing, causing compilation or link failures.

It would be good to:

Explicitly state in the crate docs or README that these APIs require a sufficiently new CatBoost evaluation library, or

Gate them behind an appropriate Cargo feature / #[cfg(...)] tied to the CatBoost version you download in build.rs.

coderabbitai · 2025-12-01T18:23:14Z

src/model.rs

+    fn get_feature_indices_from_c(
+        indices_ptr: *mut usize,
+        count: usize,
+        err_msg: &str,
+    ) -> CatBoostResult<Vec<usize>> {
+        if indices_ptr.is_null() {
+            if count == 0 {
+                return Ok(Vec::new());
+            }
+            return Err(CatBoostError {
+                description: err_msg.to_owned(),
+            });
+        }
+        let mut indices = Vec::with_capacity(count);
+        for i in 0..count {
+            indices.push(unsafe { *indices_ptr.add(i) });
+        }
+        unsafe { libc::free(indices_ptr as *mut _) };
+        Ok(indices)
+    }


⚠️ Potential issue | 🟠 Major

Add documentation confirming the C API memory ownership contract

The get_feature_indices_from_c and get_feature_names_from_c functions both use libc::free() to deallocate pointers returned from the CatBoost C API, which requires verifying and documenting the memory ownership contract:

Confirm in CatBoost C API documentation or comments that these functions allocate with malloc-compatible allocators and expect callers to deallocate with free

Add a SAFETY comment in each helper function explaining this assumption

Consider centralizing the pattern in a single unsafe wrapper (e.g., from_c_allocated_buffer<T>) to reduce duplication and make the contract explicit

Without documented verification of the allocator compatibility, this represents a potential source of undefined behavior (heap corruption/use-after-free), particularly if CatBoost or its dependencies use custom allocators or library-specific deallocators.

🤖 Prompt for AI Agents

In src/model.rs around lines 278 to 297, both get_feature_indices_from_c and get_feature_names_from_c call libc::free on pointers returned by the CatBoost C API without documenting the allocator contract; update the code by (1) adding a SAFETY comment above each helper stating that the CatBoost C API must return malloc-compatible buffers that the caller must free with libc::free (cite the exact CatBoost C API doc or source line that confirms this), (2) centralize the pattern by implementing a single unsafe helper (e.g., from_c_allocated_buffer<T>) that reads count elements, frees the pointer with libc::free, and returns a Vec<T>, and (3) replace the existing functions to call that helper so the ownership contract and unsafe boundary are explicit and documented.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/model.rs (4)
278-303: Excellent safety documentation—past concerns addressed.

The comprehensive safety documentation and centralized memory management pattern directly addresses the previous review comment about documenting the C API memory ownership contract. The implementation correctly copies data before freeing and handles null pointers appropriately.

Optional: Consider using std::slice::from_raw_parts for a more idiomatic conversion:
unsafe fn from_c_allocated_buffer<T: Copy>(ptr: *mut T, count: usize) -> Vec<T> {
    if ptr.is_null() {
        return Vec::new();
    }
    let result = unsafe { std::slice::from_raw_parts(ptr, count).to_vec() };
    unsafe { libc::free(ptr as *mut _) };
    result
}
356-380: Consider adding feature gate to prevent dead code.

This helper function is only called by feature-gated public APIs and itself calls get_feature_names() which is also feature-gated. Adding the same gate here would prevent dead code warnings when the feature is disabled.

Apply this diff:
     /// Get names of specific type of features used in model,
     /// returns error if index out of bounds
+    #[cfg(catboost_feature_indices)]
     fn get_specific_feature_names(
382-415: Public API looks good; private helper could also use feature gate.

get_feature_names is properly implemented with appropriate error handling and memory management. The get_feature_indices helper has the same dead code consideration as get_specific_feature_names—it's only used by feature-gated code but isn't itself gated.

Consider adding #[cfg(catboost_feature_indices)] to get_feature_indices (line 400) as well to maintain consistency with get_specific_feature_names.

417-451: LGTM! Consistent public API implementation.

All four feature name getter functions follow a consistent pattern with proper feature gating and error handling. The implementation correctly delegates to get_specific_feature_names with the appropriate C function for each feature type.

Optional: The doc comments could be enhanced with information about return values and potential failure cases. For example:
/// Get names of float features used in model
///
/// # Returns
/// A vector of feature names, or an error if:
/// - The C API call fails
/// - Feature indices are out of bounds
///
/// # Example
/// ```ignore
/// let float_names = model.get_float_feature_names()?;
/// ```

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68ce50e and 2831014.

📒 Files selected for processing (2)

Cargo.toml (1 hunks)
src/model.rs (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/model.rs (1)

src/error.rs (1)

check_return_value (15-21)

🔇 Additional comments (2)

src/model.rs (1)

305-354: LGTM! Proper C memory management for both simple and nested structures.

Both functions correctly handle the C memory lifecycle:

get_feature_indices_from_c uses the centralized helper for flat arrays

get_feature_names_from_c properly handles the 2D structure by freeing each inner string pointer before freeing the outer array

Null pointer and empty cases are handled correctly

Cargo.toml (1)

13-14: The libc version 0.2.178 is valid and free from known vulnerabilities.

Version 0.2.178 is the latest stable release on crates.io with no security advisories in the RustSec database. The dependency is appropriate for FFI memory management operations.

add get feature names

26b0a74

fmt

5d9e87a

coderabbitai bot reviewed Dec 1, 2025

View reviewed changes

marcomq added 2 commits December 2, 2025 18:17

error handling and specific cfg

68ce50e

add safety comments

2831014

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_feature_names#4

Add get_feature_names#4
marcomq wants to merge 4 commits intoaryehlev:publishfrom
marcomq:publish

marcomq commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 1, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 1, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

marcomq commented Dec 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

marcomq commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 1, 2025 •

edited

Loading