Skip to content

decoder: New method simdjson_decode_from_stream#30

Merged
JakubOnderka merged 5 commits intomasterfrom
version5
Jan 7, 2025
Merged

decoder: New method simdjson_decode_from_stream#30
JakubOnderka merged 5 commits intomasterfrom
version5

Conversation

@JakubOnderka
Copy link
Copy Markdown
Owner

@JakubOnderka JakubOnderka commented Jan 7, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added ability to decode JSON directly from file streams.
    • Expanded JSON parsing capabilities to support buffer-based decoding.
  • Bug Fixes

    • Improved exception handling by introducing specific decoder and encoder exception classes.
    • Updated exception handling in tests to reflect new exception types.
  • Performance Improvements

    • Optimized JSON string processing with more direct memory handling.
    • Enhanced stream reading and memory allocation techniques.
  • Compatibility

    • Maintained existing JSON decoding functionality while introducing new parsing methods.

These updates provide more flexible JSON handling in the simdjson PHP extension, allowing developers to work with JSON data from various sources more efficiently.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 7, 2025

Walkthrough

This pull request enhances the simdjson PHP extension by introducing new functionality for JSON decoding from streams and modifying existing parsing mechanisms. The changes primarily focus on improving how JSON data is processed, with updates to function signatures, memory management, and stream handling. A new function simdjson_decode_from_stream is added, allowing JSON decoding directly from file system pointer resources, and several internal functions are modified to support more flexible JSON parsing.

Changes

File Change Summary
php_simdjson.cpp - Modified simdjson_simple_decode function signature to accept const char * and size_t instead of zend_string *
- Adjusted internal logic for JSON processing
- Added new simdjson_stream_copy_to_mem function for stream reading
- Introduced simdjson_decode_from_stream PHP function
php_simdjson.h - Added php_simdjson_parse_buffer function declaration for parsing JSON from a buffer
- Updated exception class declarations
simdjson.stub.php - Added simdjson_decode_from_stream function
- Updated documentation for the new function
- Introduced new exception classes
simdjson_arginfo.h - Added argument information for simdjson_decode_from_stream function
src/simdjson_decoder.cpp - Implemented php_simdjson_parse_buffer function
- Minor indentation adjustment in php_simdjson_error_msg
- Updated exception handling to use new decoder exception
tests/decode_stream.phpt - Introduced a new test file to validate simdjson_decode_from_stream() functionality
tests/compat/*.phpt - Updated exception handling in multiple test files to reflect changes from SimdJsonException to SimdJsonDecoderException

Possibly related PRs

  • decoder: When parsing bigger JSON than 1 MB, use separate parser that… #23: The changes in php_simdjson.cpp regarding the management of the JSON parser's lifecycle and memory usage are related to the main PR's updates to exception handling and function signatures in the same file, particularly concerning the decoding process.
  • Version5 #27: Modifications in php_simdjson.cpp that involve changes to return types and handling of string variables are relevant as they may affect the functionality of the simdjson_simple_decode function, which is also updated in the main PR.
  • test: Check possible crash when decoding/encoding different JSON files #29: The introduction of tests for simdjson_encode() and simdjson_decode() functions aligns with the main PR's enhancements to these functions, particularly in terms of error handling and functionality improvements.

Poem

🐰 A Rabbit's Ode to JSON Streams

With pointers swift and buffers bright,
We parse the JSON, byte by byte,
From streams that flow with data's might,
Our simdjson dances left and right,
Decoding magic, pure delight! 🌟


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ced3dc and 7da05ba.

📒 Files selected for processing (2)
  • simdjson.stub.php (5 hunks)
  • simdjson_arginfo.h (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • simdjson_arginfo.h
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Macos (8.3)
  • GitHub Check: Macos (8.2)
  • GitHub Check: Macos (8.1)
  • GitHub Check: Macos (8.0)
  • GitHub Check: Ubuntu-dev
  • GitHub Check: Ubuntu-dev
🔇 Additional comments (4)
simdjson.stub.php (4)

247-247: LGTM! Exception type update improves error handling specificity.

The change from SimdJsonException to SimdJsonDecoderException provides better error handling granularity.


Line range hint 277-349: LGTM! Consistent exception hierarchy updates.

The changes consistently update all functions to use the new, more specific exception types (SimdJsonDecoderException and SimdJsonEncoderException), improving error handling across the API.


353-357: LGTM! Well-designed exception hierarchy.

The new exception hierarchy follows SOLID principles:

  • Abstract base class prevents direct instantiation
  • Specific decoder/encoder exceptions enable precise error handling

253-266: 🛠️ Refactor suggestion

Update docblock for accuracy and type safety.

The docblock has several inconsistencies with the actual implementation:

  1. Description mentions "JSON encoded string" but the function accepts a stream resource
  2. Error message refers to "$json over 4GB" but should refer to stream content
  3. Resource parameter type should be more specific

Apply these changes:

 /**
- * Takes a JSON encoded string and converts it into a PHP variable.
+ * Reads and decodes JSON data from a stream into a PHP variable.
  * Similar to json_decode()
  *
- * @param resource $res A file system pointer resource that is typically created using fopen().
+ * @param resource<stream> $res A file system pointer resource that is typically created using fopen().
  * @param bool $associative When true, JSON objects will be returned as associative arrays.
  *                          When false, JSON objects will be returned as objects.
  * @param int $depth the maximum nesting depth of the structure being decoded.
  * @return array|stdClass|string|float|int|bool|null
  * @throws SimdJsonDecoderException for invalid JSON
- *                           (or $json over 4GB long, or out of range integer/float)
+ *                           (or stream content over 4GB long, or out of range integer/float)
  * @throws ValueError for invalid $depth
  */

Likely invalid or redundant comment.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
php_simdjson.h (1)

169-176: Declare new php_simdjson_parse_buffer function

A new function php_simdjson_parse_buffer has been declared (lines 169-176) to parse JSON data from a raw buffer. Please consider:

  1. Documentation Clarity

    • Enhance the documentation comment to clearly explain the padding requirement and any assumptions about the input buffer (lines 171-174).
  2. Consistent Error Codes

    • Ensure that the function returns error codes consistent with other parsing functions for seamless integration.
  3. Usage Guidelines

    • Provide guidance on how and when to use this function, especially regarding the necessity of the simdjson::SIMDJSON_PADDING.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3efa69 and 4388267.

📒 Files selected for processing (5)
  • php_simdjson.cpp (4 hunks)
  • php_simdjson.h (1 hunks)
  • simdjson.stub.php (1 hunks)
  • simdjson_arginfo.h (4 hunks)
  • src/simdjson_decoder.cpp (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Ubuntu-dev
  • GitHub Check: Macos (8.4)
  • GitHub Check: Macos (8.3)
  • GitHub Check: Macos (8.2)
  • GitHub Check: Macos (8.1)
  • GitHub Check: Macos (8.0)
  • GitHub Check: Ubuntu-dev
🔇 Additional comments (8)
php_simdjson.cpp (5)

113-117: Update simdjson_simple_decode function signature

The function simdjson_simple_decode now accepts const char *json and size_t len instead of zend_string *json (lines 113-117). This change improves performance by avoiding the overhead of zend_string. Ensure that all calls to this function are updated accordingly to match the new signature.


130-133: Verify memcmp usage for boolean values

The checks for the boolean values "true" and "false" have been updated to use memcmp (lines 130-133). Verify that this change correctly handles these values, including proper length checks and case sensitivity, to prevent any unexpected behavior.


175-220: Implement simdjson_stream_copy_to_mem function

A new function simdjson_stream_copy_to_mem has been added (lines 175-220) to read JSON data from a PHP stream into memory with proper padding. Consider the following:

  1. Memory Allocation and Padding

    • Ensure that memory allocations account for the required padding (simdjson::SIMDJSON_PADDING) as per simdjson's requirements (lines 215-217).
  2. Error Handling

    • Check that the function gracefully handles errors, such as read failures or empty streams, and frees allocated memory appropriately (lines 212-215).
  3. Stream Position Handling

    • Verify that the function correctly handles the stream's position, especially if the stream is non-blocking or if it reaches EOF.

222-269: Add simdjson_decode_from_stream function

The new PHP function simdjson_decode_from_stream allows decoding JSON data directly from a stream resource (lines 222-269). Please review the following:

  1. Parameter Validation

    • Ensure that the provided resource is a valid stream before proceeding (lines 223-238). Consider adding checks to handle invalid resources.
  2. Depth Validation

    • The depth parameter validation uses argument position 3 (line 240). Since depth is the third optional argument, confirm that this aligns with the function's parameter order.
  3. Memory Management

    • After decoding, the allocated memory for json is freed (line 263). Ensure that there are no code paths where json might not be freed, which could lead to memory leaks.
  4. Error Handling Consistency

    • Verify that errors are handled consistently with other decode functions, including throwing exceptions when parsing fails (lines 265-268).

249-252: Optimize simple JSON decoding

The call to simdjson_simple_decode for simple JSON values (lines 249-252) enhances performance for small inputs. Ensure that this optimization correctly handles all simple JSON cases when reading from streams and does not introduce any discrepancies compared to string-based decoding.

simdjson_arginfo.h (1)

17-22: Add simdjson_decode_from_stream to function entries

The simdjson_decode_from_stream function has been added to the argument info and function entries (lines 17-22, 76-77, 92-94). Please verify the following:

  1. Argument Definitions

    • Confirm that the arguments are correctly defined, including types and default values (lines 18-21). The resource parameter res should be properly typed.
  2. Stub Hash Update

    • The stub hash has changed (line 2). Ensure that this update is consistent with the changes in the stub file and that the hash is correctly generated.
  3. Function Registration

    • Verify that the function is registered appropriately in ext_functions (lines 92-94) and that it will be available to PHP scripts.

To ensure consistency, run the following script to check the function declarations and registrations:

Also applies to: 76-77, 92-94

✅ Verification successful

Function implementation and registration verified

The simdjson_decode_from_stream function is correctly implemented with:

  • Proper resource parameter handling and type checking
  • Matching argument definitions and implementation
  • Complete function registration in the extension
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that 'simdjson_decode_from_stream' is correctly declared and registered.

# Check for function declaration in header files
rg '^PHP_FUNCTION\(simdjson_decode_from_stream\)' --type-add cppheader:*.h --type cppheader

# Check for argument info definition
rg '^ZEND_BEGIN_ARG_WITH_RETURN_TYPE_INFO_EX\(arginfo_simdjson_decode_from_stream' simdjson_arginfo.h

# Confirm function entry in ext_functions
rg 'ZEND_FE\(simdjson_decode_from_stream' simdjson_arginfo.h

Length of output: 433


Script:

#!/bin/bash
# Check for function implementation in cpp files
rg '^PHP_FUNCTION\(simdjson_decode_from_stream\)' --type cpp

# Look for resource type definitions and handling
rg 'le_simdjson|zend_register_resource' --type cpp -A 2

# Check for stream handling patterns
ast-grep --pattern 'PHP_FUNCTION(simdjson_decode_from_stream) {
  $$$
}'

Length of output: 256


Script:

#!/bin/bash
# Get the implementation details with context
rg -A 20 '^PHP_FUNCTION\(simdjson_decode_from_stream\)' php_simdjson.cpp

# Look for stream-related type definitions
rg 'php_stream|zend_resource' php_simdjson.cpp

# Check for any resource cleanup or destruction
rg 'ZEND_(FETCH|VERIFY)_RESOURCE' php_simdjson.cpp

Length of output: 1225

simdjson.stub.php (1)

253-266: LGTM! Well-documented function signature.

The new function simdjson_decode_from_stream is properly defined with comprehensive documentation, including:

  • Clear parameter descriptions
  • Return type specification
  • Exception handling documentation
src/simdjson_decoder.cpp (1)

553-558: LGTM! Efficient implementation with proper error handling.

The implementation is well-structured and includes performance optimization by avoiding unnecessary reallocation with realloc_if_needed=false. The function properly reuses existing parsing logic and error handling mechanisms.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
simdjson.stub.php (1)

253-266: LGTM! Consider adding examples to the documentation.

The function signature and documentation are well-structured. To enhance usability, consider adding usage examples in the documentation, showing how to use the function with different types of streams (e.g., file streams, memory streams).

Example addition to the documentation:

 * @throws ValueError for invalid $depth
+ * @example
+ * // Example with file stream
+ * $fp = fopen('data.json', 'r');
+ * $data = simdjson_decode_from_stream($fp);
+ * fclose($fp);
php_simdjson.h (1)

169-176: LGTM! Consider documenting the padding size requirement.

The function declaration is well-structured and properly exposed via PHP_SIMDJSON_API. To improve clarity, consider documenting the specific padding size requirement.

Add padding size information to the documentation:

 * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
+ * The required padding size is defined by simdjson::SIMDJSON_PADDING (currently 32 bytes).
php_simdjson.cpp (1)

175-220: Consider adding overflow checks for buffer size calculations.

The stream reading implementation is well-structured, but buffer size calculations should be checked for potential overflow.

-		buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
+		size_t remaining = MAX(ssbuf.sb.st_size - src->position, 0);
+		if (UNEXPECTED(remaining > SIZE_MAX - step)) {
+			php_error_docref(NULL, E_WARNING, "Input is too large");
+			return NULL;
+		}
+		buflen = ZEND_MM_ALIGNED_SIZE(remaining) + step;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4388267 and ea19bfe.

📒 Files selected for processing (5)
  • php_simdjson.cpp (4 hunks)
  • php_simdjson.h (1 hunks)
  • simdjson.stub.php (1 hunks)
  • simdjson_arginfo.h (4 hunks)
  • src/simdjson_decoder.cpp (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Ubuntu-dev
  • GitHub Check: Macos (8.4)
  • GitHub Check: Macos (8.0)
  • GitHub Check: Macos (8.3)
  • GitHub Check: Macos (8.2)
  • GitHub Check: Ubuntu-dev
🔇 Additional comments (4)
php_simdjson.cpp (2)

Line range hint 113-139: LGTM! Efficient implementation of simple JSON value parsing.

The changes to use direct buffer access and memcmp improve efficiency for simple JSON values.


222-269: LGTM! Well-structured stream decoding implementation.

The implementation includes proper resource validation, error handling, and memory management. All error paths properly clean up allocated memory.

simdjson_arginfo.h (1)

17-21: LGTM! Proper argument info and function registration.

The argument info structure and function registration are correctly implemented, matching the function signature in simdjson.stub.php.

Also applies to: 76-76, 92-92

src/simdjson_decoder.cpp (1)

553-558: LGTM! Efficient implementation of buffer-based parsing.

The new function provides an optimized path for parsing JSON directly from buffers:

  • Avoids zend_string overhead
  • Prevents unnecessary reallocations
  • Maintains consistent error handling with other parsing functions

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
tests/decode_stream.phpt (1)

12-16: Enhance error handling test.

The test should verify the specific error message and code thrown by the empty memory stream.

 try {
-    simdjson_decode_from_stream(fopen("php://memory", "r"));
+    $handle = fopen("php://memory", "r");
+    simdjson_decode_from_stream($handle);
+    fclose($handle);
+    assert(false, "Expected exception was not thrown");
 } catch (Exception $e) {
-    echo $e->getMessage();
+    assert($e instanceof SimdJsonException, "Expected SimdJsonException, got " . get_class($e));
+    assert($e->getCode() === SIMDJSON_ERR_EMPTY, "Expected SIMDJSON_ERR_EMPTY error code");
+    assert($e->getMessage() === "no JSON found", "Unexpected error message: " . $e->getMessage());
 }
src/simdjson_decoder.cpp (1)

553-558: LGTM! Consider adding documentation.

The implementation is clean and follows the established patterns. It properly reuses existing parsing and conversion logic while handling errors consistently.

Consider adding a comment to document that the input buffer must contain simdjson::SIMDJSON_PADDING extra bytes of padding for the SIMD operations to work safely.

php_simdjson.cpp (1)

222-269: LGTM with a note about error handling.

The implementation is well-structured and follows the established patterns in the codebase. It properly handles resources, memory management, and parser reuse.

Consider adding more specific error messages for different stream-related failures (e.g., network errors, timeouts) instead of using the generic simdjson::EMPTY error.

Example improvement for error handling:

 if ((json = simdjson_stream_copy_to_mem(stream, &len)) == NULL) {
-    php_simdjson_throw_jsonexception(simdjson::EMPTY);
+    if (php_stream_eof(stream)) {
+        php_simdjson_throw_jsonexception(simdjson::EMPTY);
+    } else {
+        php_simdjson_throw_jsonexception(simdjson::IO_ERROR);
+    }
     RETURN_THROWS();
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea19bfe and 613a38f.

📒 Files selected for processing (6)
  • php_simdjson.cpp (4 hunks)
  • php_simdjson.h (1 hunks)
  • simdjson.stub.php (1 hunks)
  • simdjson_arginfo.h (4 hunks)
  • src/simdjson_decoder.cpp (2 hunks)
  • tests/decode_stream.phpt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • simdjson.stub.php
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Ubuntu-dev
  • GitHub Check: Ubuntu-dev
🔇 Additional comments (2)
simdjson_arginfo.h (1)

17-21: LGTM!

The function declaration and registration are well-structured and consistent with other decode functions in the file. The parameter types and return type are appropriate for JSON decoding from a stream.

Also applies to: 92-92

php_simdjson.cpp (1)

Line range hint 113-139: LGTM!

The function has been optimized to work directly with character pointers and lengths, using efficient memory comparison operations. The logic remains correct while improving performance by avoiding unnecessary string operations.

Comment on lines +7 to +10
foreach ($filesToCheck as $file) {
echo "Processing file $file\n";
simdjson_decode_from_stream(fopen($file, 'r'));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add verification of decoded JSON data.

The test only checks if decoding succeeds but doesn't verify the correctness of the decoded data. Consider adding assertions to validate the structure and content of the decoded JSON.

 foreach ($filesToCheck as $file) {
     echo "Processing file $file\n";
-    simdjson_decode_from_stream(fopen($file, 'r'));
+    $handle = fopen($file, 'r');
+    $decoded = simdjson_decode_from_stream($handle);
+    fclose($handle);
+    
+    // Verify decoded data matches expected structure
+    $expected = json_decode(file_get_contents($file), true);
+    assert($decoded == $expected, "Decoded data doesn't match expected for file: $file");
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
foreach ($filesToCheck as $file) {
echo "Processing file $file\n";
simdjson_decode_from_stream(fopen($file, 'r'));
}
foreach ($filesToCheck as $file) {
echo "Processing file $file\n";
$handle = fopen($file, 'r');
$decoded = simdjson_decode_from_stream($handle);
fclose($handle);
// Verify decoded data matches expected structure
$expected = json_decode(file_get_contents($file), true);
assert($decoded == $expected, "Decoded data doesn't match expected for file: $file");
}

Comment on lines +169 to +177

/**
* Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
*
* If the returned error code is 0, then return_value contains the parsed value.
* If the returned error code is non-0, then return_value will not be initialized.
*/
PHP_SIMDJSON_API simdjson_php_error_code php_simdjson_parse_buffer(simdjson_php_parser* parser, const char *json, size_t len, zval *return_value, bool associative, size_t depth);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance documentation for buffer padding requirement.

The function requires the input buffer to have simdjson::SIMDJSON_PADDING extra bytes, but this requirement isn't documented in the parameter list. This could lead to buffer overflows if callers aren't aware of the padding requirement.

 /**
  * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
  *
+ * @param simdjson_php_parser* parser The parser instance to use
+ * @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated
+ * @param size_t len The length of the JSON string (excluding padding)
+ * @param zval *return_value The zval to store the parsed result
+ * @param bool associative Whether to parse objects as associative arrays
+ * @param size_t depth Maximum nesting depth
+ * @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len
  *
  * If the returned error code is 0, then return_value contains the parsed value.
  * If the returned error code is non-0, then return_value will not be initialized.
  */
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
*
* If the returned error code is 0, then return_value contains the parsed value.
* If the returned error code is non-0, then return_value will not be initialized.
*/
PHP_SIMDJSON_API simdjson_php_error_code php_simdjson_parse_buffer(simdjson_php_parser* parser, const char *json, size_t len, zval *return_value, bool associative, size_t depth);
/**
* Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
*
* @param simdjson_php_parser* parser The parser instance to use
* @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated
* @param size_t len The length of the JSON string (excluding padding)
* @param zval *return_value The zval to store the parsed result
* @param bool associative Whether to parse objects as associative arrays
* @param size_t depth Maximum nesting depth
* @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len
*
* If the returned error code is 0, then return_value contains the parsed value.
* If the returned error code is non-0, then return_value will not be initialized.
*/
PHP_SIMDJSON_API simdjson_php_error_code php_simdjson_parse_buffer(simdjson_php_parser* parser, const char *json, size_t len, zval *return_value, bool associative, size_t depth);

Comment on lines +175 to +228
// Simplified version of _php_stream_copy_to_mem that allways allocated string with required padding and returns
// char* instead of of zend_string* to avoid unnecessary overhead
static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) {
ssize_t ret = 0;
char *ptr;
size_t buflen;
int step = 8192;
int min_room = 8192 / 4;
php_stream_statbuf ssbuf;
char* result;

/* avoid many reallocs by allocating a good-sized chunk to begin with, if
* we can. Note that the stream may be filtered, in which case the stat
* result may be inaccurate, as the filter may inflate or deflate the
* number of bytes that we can read. In order to avoid an upsize followed
* by a downsize of the buffer, overestimate by the step size (which is
* 8K). */
if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) {
buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
} else {
buflen = step;
}

result = (char*) emalloc(buflen);
ptr = result;

while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) {
*len += ret;
if (*len + min_room >= buflen) {
buflen += step;
result = (char*) erealloc(result, buflen);
ptr = result + *len;
} else {
ptr += ret;
}
}

if (*len == 0) {
efree(result);
result = NULL;
} else if (UNEXPECTED(*len + simdjson::SIMDJSON_PADDING > buflen)) {
result = (char*) erealloc(result, ZEND_MM_ALIGNED_SIZE(*len + simdjson::SIMDJSON_PADDING));
}

return result;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add safety checks for stream size and read timeout.

While the implementation is efficient, consider adding the following safety measures:

  1. Maximum size limit to prevent memory exhaustion attacks
  2. Timeout mechanism for slow or hanging streams
  3. Consider using php_stream_copy_to_mem_ex with size limit instead of custom implementation

Example implementation with size limit:

 static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) {
+    const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit
     ssize_t ret = 0;
     char *ptr;
     size_t buflen;
     int step = 8192;
     int min_room = 8192 / 4;
     php_stream_statbuf ssbuf;
     char* result;
 
     if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) {
+        if (ssbuf.sb.st_size > MAX_SIZE) {
+            php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+            return NULL;
+        }
         buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
     } else {
         buflen = step;
     }
 
     result = (char*) emalloc(buflen);
     ptr = result;
 
     while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) {
         *len += ret;
+        if (*len > MAX_SIZE) {
+            efree(result);
+            php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+            return NULL;
+        }
         if (*len + min_room >= buflen) {
             buflen += step;
             result = (char*) erealloc(result, buflen);
             ptr = result + *len;
         } else {
             ptr += ret;
         }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Simplified version of _php_stream_copy_to_mem that allways allocated string with required padding and returns
// char* instead of of zend_string* to avoid unnecessary overhead
static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) {
ssize_t ret = 0;
char *ptr;
size_t buflen;
int step = 8192;
int min_room = 8192 / 4;
php_stream_statbuf ssbuf;
char* result;
/* avoid many reallocs by allocating a good-sized chunk to begin with, if
* we can. Note that the stream may be filtered, in which case the stat
* result may be inaccurate, as the filter may inflate or deflate the
* number of bytes that we can read. In order to avoid an upsize followed
* by a downsize of the buffer, overestimate by the step size (which is
* 8K). */
if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) {
buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
} else {
buflen = step;
}
result = (char*) emalloc(buflen);
ptr = result;
while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) {
*len += ret;
if (*len + min_room >= buflen) {
buflen += step;
result = (char*) erealloc(result, buflen);
ptr = result + *len;
} else {
ptr += ret;
}
}
if (*len == 0) {
efree(result);
result = NULL;
} else if (UNEXPECTED(*len + simdjson::SIMDJSON_PADDING > buflen)) {
result = (char*) erealloc(result, ZEND_MM_ALIGNED_SIZE(*len + simdjson::SIMDJSON_PADDING));
}
return result;
}
static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) {
const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit
ssize_t ret = 0;
char *ptr;
size_t buflen;
int step = 8192;
int min_room = 8192 / 4;
php_stream_statbuf ssbuf;
char* result;
/* avoid many reallocs by allocating a good-sized chunk to begin with, if
* we can. Note that the stream may be filtered, in which case the stat
* result may be inaccurate, as the filter may inflate or deflate the
* number of bytes that we can read. In order to avoid an upsize followed
* by a downsize of the buffer, overestimate by the step size (which is
* 8K). */
if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) {
if (ssbuf.sb.st_size > MAX_SIZE) {
php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
return NULL;
}
buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
} else {
buflen = step;
}
result = (char*) emalloc(buflen);
ptr = result;
while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) {
*len += ret;
if (*len > MAX_SIZE) {
efree(result);
php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
return NULL;
}
if (*len + min_room >= buflen) {
buflen += step;
result = (char*) erealloc(result, buflen);
ptr = result + *len;
} else {
ptr += ret;
}
}
if (*len == 0) {
efree(result);
result = NULL;
} else if (UNEXPECTED(*len + simdjson::SIMDJSON_PADDING > buflen)) {
result = (char*) erealloc(result, ZEND_MM_ALIGNED_SIZE(*len + simdjson::SIMDJSON_PADDING));
}
return result;
}

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/compat/json_decode_invalid_utf8.phpt (1)

20-24: LGTM! Consider adding more edge cases.

The exception type update is consistent with the broader changes. The test covers basic invalid UTF-8 scenarios.

Consider adding test cases for:

  • UTF-8 sequences longer than 4 bytes
  • Incomplete UTF-8 sequences at string boundaries
  • UTF-16 surrogate pairs
tests/decode_max_depth_memory_reduction.phpt (1)

30-30: Consider adding memory usage verification.

While the test verifies functionality, it would be valuable to confirm the memory optimization claims.

Add memory usage tracking:

 foreach ([1024, 1 << 27] as $depth) {
+    $memory_before = memory_get_usage(true);
     echo "Test depth=$depth:\n";
     $value = simdjson_decode('[]', true, $depth);
+    $memory_after = memory_get_usage(true);
+    echo "Memory used: " . ($memory_after - $memory_before) . " bytes\n";
     var_dump($value);

Also applies to: 35-35

tests/compat/bug69187.phpt (1)

10-10: Document error codes and add constant checks.

The test covers error cases well, but could benefit from better documentation and validation.

Add error code validation and documentation:

+// Error code mapping for compatibility testing
+$error_map = [
+    SIMDJSON_ERR_NO_JSON_FOUND => JSON_ERROR_SYNTAX,
+    SIMDJSON_ERR_INVALID_UTF8 => JSON_ERROR_UTF8,
+    SIMDJSON_ERR_UNESCAPED_CHARS => JSON_ERROR_CTRL_CHAR,
+];
+
 function compat_decode($value) {
     global $lasterr;
     $lasterr = 0;
     try {
         return simdjson_decode($value);
     } catch (SimdJsonDecoderException $e) {
+        // Verify error code mapping
+        assert(isset($error_map[$e->getCode()]), "Unmapped error code: " . $e->getCode());
         return sprintf("%s: %d: %s", get_class($e), $e->getCode(), $e->getMessage());
     }
 }

Also applies to: 31-32, 36-40

src/simdjson_encoder.h (1)

22-22: LGTM! Consider adding documentation.

The renaming of simdjson_error_code to simdjson_encoder_error_code improves code organization by clearly separating encoder-specific error codes. Consider adding documentation to describe the purpose and usage of each error code.

Also applies to: 36-36

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 613a38f and 923a6de.

📒 Files selected for processing (16)
  • php_simdjson.cpp (9 hunks)
  • php_simdjson.h (2 hunks)
  • simdjson.stub.php (2 hunks)
  • simdjson_arginfo.h (6 hunks)
  • src/simdjson_decoder.cpp (2 hunks)
  • src/simdjson_encoder.h (2 hunks)
  • tests/compat/bug62010.phpt (1 hunks)
  • tests/compat/bug64874_part2.phpt (2 hunks)
  • tests/compat/bug68546.phpt (1 hunks)
  • tests/compat/bug68938.phpt (1 hunks)
  • tests/compat/bug69187.phpt (2 hunks)
  • tests/compat/fail001.phpt (2 hunks)
  • tests/compat/json_decode_invalid_utf8.phpt (1 hunks)
  • tests/decode_max_depth_memory_reduction.phpt (1 hunks)
  • tests/decode_stream.phpt (1 hunks)
  • tests/key_value_slash.phpt (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/decode_stream.phpt
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Ubuntu-dev
  • GitHub Check: Macos (8.3)
  • GitHub Check: Ubuntu-dev
🔇 Additional comments (15)
simdjson.stub.php (1)

349-353: LGTM!

The new exception classes SimdJsonDecoderException and SimdJsonEncoderException are properly defined and extend SimdJsonException, enhancing error handling for decoding and encoding processes.

tests/compat/bug68938.phpt (1)

7-7: LGTM!

Updating the catch block to use SimdJsonDecoderException and adjusting the expected output aligns with the new exception hierarchy, ensuring that decoding errors are handled appropriately.

Also applies to: 12-12

tests/compat/bug62010.phpt (1)

13-13: LGTM!

Changing the expected exception to SimdJsonDecoderException correctly reflects the updated exception handling for parsing errors during decoding.

tests/compat/bug68546.phpt (1)

26-27: LGTM! Consider validating error codes.

The exception type update is consistent. The test properly verifies null byte handling in both objects and arrays.

Consider adding assertions to validate that the error code (255) is the correct constant from the extension:

assert(255 === SIMDJSON_ERR_INVALID_PROPERTY_NAME, "Error code mismatch");

Also applies to: 30-30

tests/compat/bug64874_part2.phpt (1)

8-8: LGTM! Exception type updated correctly.

The change from SimdJsonException to SimdJsonDecoderException aligns with the broader refactoring of exception handling, maintaining test functionality while improving error specificity.

tests/key_value_slash.phpt (1)

76-77: LGTM! Exception types consistently updated.

The changes systematically update all catch blocks from SimdJsonException to SimdJsonDecoderException, maintaining consistency with the new exception handling structure while preserving test functionality.

Also applies to: 79-80, 83-85, 88-90, 93-95, 98-100, 103-105, 108-110, 113-115, 118-120, 123-125, 128-130, 133-135, 138-140, 143-145, 148-150, 153-155, 158-160, 163-165, 168-170

tests/compat/fail001.phpt (1)

38-38: LGTM! Exception types consistently updated.

The changes systematically update all catch blocks from SimdJsonException to SimdJsonDecoderException, maintaining consistency with the new exception handling structure while preserving comprehensive test coverage for various JSON parsing failure scenarios.

Also applies to: 44-44, 58-60, 63-65, 68-70, 73-75, 78-80, 83-85, 88-90, 93-95, 98-100, 103-105, 108-110, 113-115, 118-120, 123-125, 128-130, 133-135, 138-140, 143-145, 148-150, 153-155, 158-160, 163-165, 168-170

php_simdjson.h (2)

111-112: LGTM: Exception class split improves error handling granularity.

The split into separate decoder and encoder exceptions follows the separation of concerns principle and allows for more specific error handling.


171-177: 🛠️ Refactor suggestion

Document buffer padding and memory safety requirements.

The function documentation should include:

  1. The required padding size (simdjson::SIMDJSON_PADDING)
  2. Memory safety considerations
  3. Parameter descriptions

Apply this diff to improve the documentation:

 /**
- * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
+ * Parses a JSON string buffer into a PHP value.
  *
+ * @param simdjson_php_parser* parser The parser instance to use
+ * @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated
+ * @param size_t len The length of the JSON string (excluding padding)
+ * @param zval *return_value The zval to store the parsed result
+ * @param bool associative Whether to parse objects as associative arrays
+ * @param size_t depth Maximum nesting depth
+ * @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len
+ * @warning The caller must ensure memory safety and proper buffer size
+ *
  * If the returned error code is 0, then return_value contains the parsed value.
  * If the returned error code is non-0, then return_value will not be initialized.
  */

Likely invalid or redundant comment.

simdjson_arginfo.h (2)

17-21: LGTM: Well-defined argument info for stream decoding.

The argument info is properly defined with:

  • Correct return type (IS_MIXED)
  • Resource parameter
  • Consistent default values for optional parameters

109-116: LGTM: Clean exception class hierarchy.

The exception classes are properly defined with:

  • Empty method entries (inheriting parent functionality)
  • Correct inheritance from base exception
php_simdjson.cpp (3)

228-275: LGTM: Well-implemented stream decoding function.

The function:

  • Properly validates parameters
  • Handles memory management correctly
  • Uses appropriate error handling

531-531: LGTM: Proper use of specialized exceptions.

The code correctly uses the new specialized exception classes:

  • simdjson_encoder_exception_ce for encoding errors
  • Consistent error message handling

Also applies to: 582-582


176-226: ⚠️ Potential issue

Add safety checks for stream operations.

The stream copy function needs additional safety measures:

  1. Maximum size limit to prevent memory exhaustion
  2. Timeout mechanism for slow streams
  3. Consider using php_stream_copy_to_mem_ex with size limit

Apply this diff to add safety measures:

 static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) {
+    const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit
     ssize_t ret = 0;
     char *ptr;
     size_t buflen;
     int step = 8192;
     int min_room = 8192 / 4;
     php_stream_statbuf ssbuf;
     char* result;
 
     if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) {
+        if (ssbuf.sb.st_size > MAX_SIZE) {
+            php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+            return NULL;
+        }
         buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
     } else {
         buflen = step;
     }
 
     result = (char*) emalloc(buflen);
     ptr = result;
 
     while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) {
         *len += ret;
+        if (*len > MAX_SIZE) {
+            efree(result);
+            php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+            return NULL;
+        }
         if (*len + min_room >= buflen) {
             buflen += step;
             result = (char*) erealloc(result, buflen);
             ptr = result + *len;
         } else {
             ptr += ret;
         }
     }

Likely invalid or redundant comment.

src/simdjson_decoder.cpp (1)

553-558: LGTM: Well-implemented buffer parsing function.

The function:

  • Reuses existing parsing logic
  • Properly handles memory management
  • Maintains consistency with other parsing functions

Comment on lines +253 to +267
/**
* Takes a JSON encoded string and converts it into a PHP variable.
* Similar to json_decode()
*
* @param resource $res A file system pointer resource that is typically created using fopen().
* @param bool $associative When true, JSON objects will be returned as associative arrays.
* When false, JSON objects will be returned as objects.
* @param int $depth the maximum nesting depth of the structure being decoded.
* @return array|stdClass|string|float|int|bool|null
* @throws SimdJsonException for invalid JSON
* (or $json over 4GB long, or out of range integer/float)
* @throws ValueError for invalid $depth
*/
function simdjson_decode_from_stream($res, bool $associative = false, int $depth = 512): mixed {}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Update the function description and @throws annotations to reflect stream input and correct exception

The description for simdjson_decode_from_stream mentions "Takes a JSON encoded string," which is inconsistent with the $res parameter that expects a file system pointer resource. Please update the description to accurately reflect that this function reads and decodes JSON data from a stream.

Additionally, the @throws annotation currently specifies SimdJsonException, but with the introduction of SimdJsonDecoderException, it would be more precise to update it to @throws SimdJsonDecoderException to reflect the specific exception thrown by this function.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
simdjson_arginfo.h (2)

17-21: Consider adding type validation for the resource parameter.

The res parameter lacks type validation in the argument info. Consider using ZEND_ARG_TYPE_INFO with IS_RESOURCE type to ensure proper type checking at the PHP level.

-	ZEND_ARG_INFO(0, res)
+	ZEND_ARG_TYPE_INFO(0, res, IS_RESOURCE, 0)

159-180: Well-structured exception hierarchy!

The new exception hierarchy with specific decoder and encoder exceptions improves error handling granularity. This follows the best practice of having specialized exceptions for different error categories.

Consider documenting the specific use cases for each exception type in the PHP documentation to help users handle errors appropriately.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 95006e2 and 5ced3dc.

📒 Files selected for processing (18)
  • .github/workflows/integration.yml (2 hunks)
  • docker_php8.4-dev (1 hunks)
  • php_simdjson.cpp (9 hunks)
  • php_simdjson.h (2 hunks)
  • simdjson.stub.php (5 hunks)
  • simdjson_arginfo.h (7 hunks)
  • src/simdjson_decoder.cpp (2 hunks)
  • src/simdjson_encoder.h (2 hunks)
  • tests/compat/bug62010.phpt (1 hunks)
  • tests/compat/bug64874_part2.phpt (2 hunks)
  • tests/compat/bug68546.phpt (1 hunks)
  • tests/compat/bug68938.phpt (1 hunks)
  • tests/compat/bug69187.phpt (2 hunks)
  • tests/compat/fail001.phpt (2 hunks)
  • tests/compat/json_decode_invalid_utf8.phpt (1 hunks)
  • tests/decode_max_depth_memory_reduction.phpt (1 hunks)
  • tests/decode_stream.phpt (1 hunks)
  • tests/key_value_slash.phpt (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
  • .github/workflows/integration.yml
  • tests/decode_stream.phpt
  • tests/compat/bug69187.phpt
  • tests/compat/bug68546.phpt
  • tests/compat/json_decode_invalid_utf8.phpt
  • tests/compat/fail001.phpt
  • tests/compat/bug64874_part2.phpt
  • docker_php8.4-dev
  • tests/compat/bug62010.phpt
  • tests/key_value_slash.phpt
  • tests/compat/bug68938.phpt
  • tests/decode_max_depth_memory_reduction.phpt
  • src/simdjson_encoder.h
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Ubuntu-dev
  • GitHub Check: Macos (8.4)
  • GitHub Check: Macos (8.3)
  • GitHub Check: Macos (8.2)
  • GitHub Check: Macos (8.1)
  • GitHub Check: Macos (8.0)
  • GitHub Check: Ubuntu-dev
🔇 Additional comments (14)
php_simdjson.h (2)

111-112: LGTM! Good separation of concerns.

The split into decoder and encoder specific exceptions improves error handling granularity and follows the single responsibility principle.


170-177: 🛠️ Refactor suggestion

Document the buffer padding requirement.

The function requires the input buffer to have simdjson::SIMDJSON_PADDING extra bytes, but this requirement isn't documented in the parameter list. This could lead to buffer overflows if callers aren't aware of the padding requirement.

Apply this diff to improve the documentation:

 /**
  * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
  *
+ * @param simdjson_php_parser* parser The parser instance to use
+ * @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated
+ * @param size_t len The length of the JSON string (excluding padding)
+ * @param zval *return_value The zval to store the parsed result
+ * @param bool associative Whether to parse objects as associative arrays
+ * @param size_t depth Maximum nesting depth
+ * @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len
  *
  * If the returned error code is 0, then return_value contains the parsed value.
  * If the returned error code is non-0, then return_value will not be initialized.
  */

Likely invalid or redundant comment.

simdjson.stub.php (4)

353-354: LGTM! Good use of abstract class.

Making SimdJsonException abstract enforces that only specific decoder or encoder exceptions can be instantiated, which is the correct design.


355-356: LGTM! Clean implementation of decoder exception.

The decoder-specific exception class is well-structured and properly extends the abstract base class.


357-357: LGTM! Clean implementation of encoder exception.

The encoder-specific exception class is well-structured and properly extends the abstract base class.


253-266: 🛠️ Refactor suggestion

Update function description to reflect stream input.

The description incorrectly states "Takes a JSON encoded string" but the function actually takes a file system pointer resource. Update the description to accurately reflect the stream input.

Apply this diff to correct the documentation:

 /**
- * Takes a JSON encoded string and converts it into a PHP variable.
+ * Reads and decodes JSON data from a stream into a PHP variable.
  * Similar to json_decode()
  *
  * @param resource $res A file system pointer resource that is typically created using fopen().
  * @param bool $associative When true, JSON objects will be returned as associative arrays.
  *                          When false, JSON objects will be returned as objects.
  * @param int $depth the maximum nesting depth of the structure being decoded.
  * @return array|stdClass|string|float|int|bool|null
  * @throws SimdJsonDecoderException for invalid JSON
  *                           (or $json over 4GB long, or out of range integer/float)
  * @throws ValueError for invalid $depth
  */

Likely invalid or redundant comment.

php_simdjson.cpp (4)

230-277: LGTM! Well-structured implementation with proper error handling.

The implementation is thorough with:

  • Proper depth validation
  • Memory cleanup on errors
  • Consistent error handling pattern

533-534: LGTM! Consistent use of encoder-specific exception.

The changes properly use the new simdjson_encoder_exception_ce for encoding-related errors.

Also applies to: 584-585


634-637: LGTM! Proper exception class registration order.

The registration follows the correct order by registering the abstract base class first, followed by the specific exception classes.


176-228: 🛠️ Refactor suggestion

Add safety checks for stream size and read timeout.

While the implementation is efficient, consider adding the following safety measures:

  1. Maximum size limit to prevent memory exhaustion attacks
  2. Timeout mechanism for slow or hanging streams
  3. Consider using php_stream_copy_to_mem_ex with size limit instead of custom implementation

Example implementation with size limit:

 static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) {
+    const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit
     ssize_t ret = 0;
     char *ptr;
     size_t buflen;
     int step = 8192;
     int min_room = 8192 / 4;
     php_stream_statbuf ssbuf;
     char* result;
 
     if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) {
+        if (ssbuf.sb.st_size > MAX_SIZE) {
+            php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+            return NULL;
+        }
         buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
     } else {
         buflen = step;
     }
 
     result = (char*) emalloc(buflen);
     ptr = result;
 
     while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) {
         *len += ret;
+        if (*len > MAX_SIZE) {
+            efree(result);
+            php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+            return NULL;
+        }
         if (*len + min_room >= buflen) {
             buflen += step;
             result = (char*) erealloc(result, buflen);
             ptr = result + *len;
         } else {
             ptr += ret;
         }
     }

Likely invalid or redundant comment.

src/simdjson_decoder.cpp (2)

49-49: LGTM! Consistent use of decoder-specific exception.

The change properly uses the new simdjson_decoder_exception_ce for decoding-related errors.


553-558: LGTM! Efficient implementation reusing existing parsing logic.

The implementation:

  • Properly reuses the existing parsing logic
  • Correctly sets realloc_if_needed to false for pre-allocated buffers
  • Maintains consistency with other parsing functions
simdjson_arginfo.h (2)

75-75: Function registration looks good!

The new function is properly declared and registered in the extension functions array, following the established pattern.

Also applies to: 90-90


104-104: Verify __toString behavior matches JSON serialization requirements.

While reusing jsonSerialize for __toString is efficient, verify that the JSON-serialized output is always suitable for string representation. Consider whether a dedicated string conversion might be more appropriate.

Run this script to check the implementation:

Also applies to: 190-190

✅ Verification successful

Reusing jsonSerialize for __toString is correct and safe

The implementation is intentionally designed this way, as both methods should return the same base64 encoded string. Base64 encoding guarantees string-compatible output, making it safe to use jsonSerialize for string conversion.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the implementation of SimdJsonBase64Encode::jsonSerialize
# to verify it always returns a string-compatible value

ast-grep --pattern 'class SimdJsonBase64Encode {
  $$$
  function jsonSerialize() {
    $$$
  }
  $$$
}'

Length of output: 106


Script:

#!/bin/bash
# First find relevant files containing the class name
rg -l "SimdJsonBase64Encode" 

# Then look for the implementation with context
rg -A 5 "function\s+jsonSerialize" 

Length of output: 409


Script:

#!/bin/bash
# Look for the implementation in the cpp file
rg -A 10 "zim_SimdJsonBase64Encode_jsonSerialize" php_simdjson.cpp

Length of output: 66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant