decoder: New method simdjson_decode_from_stream#30
Conversation
WalkthroughThis pull request enhances the simdjson PHP extension by introducing new functionality for JSON decoding from streams and modifying existing parsing mechanisms. The changes primarily focus on improving how JSON data is processed, with updates to function signatures, memory management, and stream handling. A new function Changes
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (6)
🔇 Additional comments (4)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
php_simdjson.h (1)
169-176: Declare newphp_simdjson_parse_bufferfunctionA new function
php_simdjson_parse_bufferhas been declared (lines 169-176) to parse JSON data from a raw buffer. Please consider:
Documentation Clarity
- Enhance the documentation comment to clearly explain the padding requirement and any assumptions about the input buffer (lines 171-174).
Consistent Error Codes
- Ensure that the function returns error codes consistent with other parsing functions for seamless integration.
Usage Guidelines
- Provide guidance on how and when to use this function, especially regarding the necessity of the
simdjson::SIMDJSON_PADDING.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
php_simdjson.cpp(4 hunks)php_simdjson.h(1 hunks)simdjson.stub.php(1 hunks)simdjson_arginfo.h(4 hunks)src/simdjson_decoder.cpp(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Ubuntu-dev
- GitHub Check: Macos (8.4)
- GitHub Check: Macos (8.3)
- GitHub Check: Macos (8.2)
- GitHub Check: Macos (8.1)
- GitHub Check: Macos (8.0)
- GitHub Check: Ubuntu-dev
🔇 Additional comments (8)
php_simdjson.cpp (5)
113-117: Updatesimdjson_simple_decodefunction signatureThe function
simdjson_simple_decodenow acceptsconst char *jsonandsize_t leninstead ofzend_string *json(lines 113-117). This change improves performance by avoiding the overhead ofzend_string. Ensure that all calls to this function are updated accordingly to match the new signature.
130-133: Verifymemcmpusage for boolean valuesThe checks for the boolean values
"true"and"false"have been updated to usememcmp(lines 130-133). Verify that this change correctly handles these values, including proper length checks and case sensitivity, to prevent any unexpected behavior.
175-220: Implementsimdjson_stream_copy_to_memfunctionA new function
simdjson_stream_copy_to_memhas been added (lines 175-220) to read JSON data from a PHP stream into memory with proper padding. Consider the following:
Memory Allocation and Padding
- Ensure that memory allocations account for the required padding (
simdjson::SIMDJSON_PADDING) as per simdjson's requirements (lines 215-217).Error Handling
- Check that the function gracefully handles errors, such as read failures or empty streams, and frees allocated memory appropriately (lines 212-215).
Stream Position Handling
- Verify that the function correctly handles the stream's position, especially if the stream is non-blocking or if it reaches EOF.
222-269: Addsimdjson_decode_from_streamfunctionThe new PHP function
simdjson_decode_from_streamallows decoding JSON data directly from a stream resource (lines 222-269). Please review the following:
Parameter Validation
- Ensure that the provided resource is a valid stream before proceeding (lines 223-238). Consider adding checks to handle invalid resources.
Depth Validation
- The depth parameter validation uses argument position
3(line 240). Since depth is the third optional argument, confirm that this aligns with the function's parameter order.Memory Management
- After decoding, the allocated memory for
jsonis freed (line 263). Ensure that there are no code paths wherejsonmight not be freed, which could lead to memory leaks.Error Handling Consistency
- Verify that errors are handled consistently with other decode functions, including throwing exceptions when parsing fails (lines 265-268).
249-252: Optimize simple JSON decodingThe call to
simdjson_simple_decodefor simple JSON values (lines 249-252) enhances performance for small inputs. Ensure that this optimization correctly handles all simple JSON cases when reading from streams and does not introduce any discrepancies compared to string-based decoding.simdjson_arginfo.h (1)
17-22: Addsimdjson_decode_from_streamto function entriesThe
simdjson_decode_from_streamfunction has been added to the argument info and function entries (lines 17-22, 76-77, 92-94). Please verify the following:
Argument Definitions
- Confirm that the arguments are correctly defined, including types and default values (lines 18-21). The resource parameter
resshould be properly typed.Stub Hash Update
- The stub hash has changed (line 2). Ensure that this update is consistent with the changes in the stub file and that the hash is correctly generated.
Function Registration
- Verify that the function is registered appropriately in
ext_functions(lines 92-94) and that it will be available to PHP scripts.To ensure consistency, run the following script to check the function declarations and registrations:
Also applies to: 76-77, 92-94
✅ Verification successful
Function implementation and registration verified
The
simdjson_decode_from_streamfunction is correctly implemented with:
- Proper resource parameter handling and type checking
- Matching argument definitions and implementation
- Complete function registration in the extension
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that 'simdjson_decode_from_stream' is correctly declared and registered. # Check for function declaration in header files rg '^PHP_FUNCTION\(simdjson_decode_from_stream\)' --type-add cppheader:*.h --type cppheader # Check for argument info definition rg '^ZEND_BEGIN_ARG_WITH_RETURN_TYPE_INFO_EX\(arginfo_simdjson_decode_from_stream' simdjson_arginfo.h # Confirm function entry in ext_functions rg 'ZEND_FE\(simdjson_decode_from_stream' simdjson_arginfo.hLength of output: 433
Script:
#!/bin/bash # Check for function implementation in cpp files rg '^PHP_FUNCTION\(simdjson_decode_from_stream\)' --type cpp # Look for resource type definitions and handling rg 'le_simdjson|zend_register_resource' --type cpp -A 2 # Check for stream handling patterns ast-grep --pattern 'PHP_FUNCTION(simdjson_decode_from_stream) { $$$ }'Length of output: 256
Script:
#!/bin/bash # Get the implementation details with context rg -A 20 '^PHP_FUNCTION\(simdjson_decode_from_stream\)' php_simdjson.cpp # Look for stream-related type definitions rg 'php_stream|zend_resource' php_simdjson.cpp # Check for any resource cleanup or destruction rg 'ZEND_(FETCH|VERIFY)_RESOURCE' php_simdjson.cppLength of output: 1225
simdjson.stub.php (1)
253-266: LGTM! Well-documented function signature.The new function
simdjson_decode_from_streamis properly defined with comprehensive documentation, including:
- Clear parameter descriptions
- Return type specification
- Exception handling documentation
src/simdjson_decoder.cpp (1)
553-558: LGTM! Efficient implementation with proper error handling.The implementation is well-structured and includes performance optimization by avoiding unnecessary reallocation with
realloc_if_needed=false. The function properly reuses existing parsing logic and error handling mechanisms.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
simdjson.stub.php (1)
253-266: LGTM! Consider adding examples to the documentation.The function signature and documentation are well-structured. To enhance usability, consider adding usage examples in the documentation, showing how to use the function with different types of streams (e.g., file streams, memory streams).
Example addition to the documentation:
* @throws ValueError for invalid $depth + * @example + * // Example with file stream + * $fp = fopen('data.json', 'r'); + * $data = simdjson_decode_from_stream($fp); + * fclose($fp);php_simdjson.h (1)
169-176: LGTM! Consider documenting the padding size requirement.The function declaration is well-structured and properly exposed via PHP_SIMDJSON_API. To improve clarity, consider documenting the specific padding size requirement.
Add padding size information to the documentation:
* Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING + * The required padding size is defined by simdjson::SIMDJSON_PADDING (currently 32 bytes).php_simdjson.cpp (1)
175-220: Consider adding overflow checks for buffer size calculations.The stream reading implementation is well-structured, but buffer size calculations should be checked for potential overflow.
- buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step; + size_t remaining = MAX(ssbuf.sb.st_size - src->position, 0); + if (UNEXPECTED(remaining > SIZE_MAX - step)) { + php_error_docref(NULL, E_WARNING, "Input is too large"); + return NULL; + } + buflen = ZEND_MM_ALIGNED_SIZE(remaining) + step;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
php_simdjson.cpp(4 hunks)php_simdjson.h(1 hunks)simdjson.stub.php(1 hunks)simdjson_arginfo.h(4 hunks)src/simdjson_decoder.cpp(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Ubuntu-dev
- GitHub Check: Macos (8.4)
- GitHub Check: Macos (8.0)
- GitHub Check: Macos (8.3)
- GitHub Check: Macos (8.2)
- GitHub Check: Ubuntu-dev
🔇 Additional comments (4)
php_simdjson.cpp (2)
Line range hint
113-139: LGTM! Efficient implementation of simple JSON value parsing.The changes to use direct buffer access and memcmp improve efficiency for simple JSON values.
222-269: LGTM! Well-structured stream decoding implementation.The implementation includes proper resource validation, error handling, and memory management. All error paths properly clean up allocated memory.
simdjson_arginfo.h (1)
17-21: LGTM! Proper argument info and function registration.The argument info structure and function registration are correctly implemented, matching the function signature in simdjson.stub.php.
Also applies to: 76-76, 92-92
src/simdjson_decoder.cpp (1)
553-558: LGTM! Efficient implementation of buffer-based parsing.The new function provides an optimized path for parsing JSON directly from buffers:
- Avoids zend_string overhead
- Prevents unnecessary reallocations
- Maintains consistent error handling with other parsing functions
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
tests/decode_stream.phpt (1)
12-16: Enhance error handling test.The test should verify the specific error message and code thrown by the empty memory stream.
try { - simdjson_decode_from_stream(fopen("php://memory", "r")); + $handle = fopen("php://memory", "r"); + simdjson_decode_from_stream($handle); + fclose($handle); + assert(false, "Expected exception was not thrown"); } catch (Exception $e) { - echo $e->getMessage(); + assert($e instanceof SimdJsonException, "Expected SimdJsonException, got " . get_class($e)); + assert($e->getCode() === SIMDJSON_ERR_EMPTY, "Expected SIMDJSON_ERR_EMPTY error code"); + assert($e->getMessage() === "no JSON found", "Unexpected error message: " . $e->getMessage()); }src/simdjson_decoder.cpp (1)
553-558: LGTM! Consider adding documentation.The implementation is clean and follows the established patterns. It properly reuses existing parsing and conversion logic while handling errors consistently.
Consider adding a comment to document that the input buffer must contain
simdjson::SIMDJSON_PADDINGextra bytes of padding for the SIMD operations to work safely.php_simdjson.cpp (1)
222-269: LGTM with a note about error handling.The implementation is well-structured and follows the established patterns in the codebase. It properly handles resources, memory management, and parser reuse.
Consider adding more specific error messages for different stream-related failures (e.g., network errors, timeouts) instead of using the generic
simdjson::EMPTYerror.Example improvement for error handling:
if ((json = simdjson_stream_copy_to_mem(stream, &len)) == NULL) { - php_simdjson_throw_jsonexception(simdjson::EMPTY); + if (php_stream_eof(stream)) { + php_simdjson_throw_jsonexception(simdjson::EMPTY); + } else { + php_simdjson_throw_jsonexception(simdjson::IO_ERROR); + } RETURN_THROWS(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
php_simdjson.cpp(4 hunks)php_simdjson.h(1 hunks)simdjson.stub.php(1 hunks)simdjson_arginfo.h(4 hunks)src/simdjson_decoder.cpp(2 hunks)tests/decode_stream.phpt(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- simdjson.stub.php
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Ubuntu-dev
- GitHub Check: Ubuntu-dev
🔇 Additional comments (2)
simdjson_arginfo.h (1)
17-21: LGTM!The function declaration and registration are well-structured and consistent with other decode functions in the file. The parameter types and return type are appropriate for JSON decoding from a stream.
Also applies to: 92-92
php_simdjson.cpp (1)
Line range hint
113-139: LGTM!The function has been optimized to work directly with character pointers and lengths, using efficient memory comparison operations. The logic remains correct while improving performance by avoiding unnecessary string operations.
| foreach ($filesToCheck as $file) { | ||
| echo "Processing file $file\n"; | ||
| simdjson_decode_from_stream(fopen($file, 'r')); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add verification of decoded JSON data.
The test only checks if decoding succeeds but doesn't verify the correctness of the decoded data. Consider adding assertions to validate the structure and content of the decoded JSON.
foreach ($filesToCheck as $file) {
echo "Processing file $file\n";
- simdjson_decode_from_stream(fopen($file, 'r'));
+ $handle = fopen($file, 'r');
+ $decoded = simdjson_decode_from_stream($handle);
+ fclose($handle);
+
+ // Verify decoded data matches expected structure
+ $expected = json_decode(file_get_contents($file), true);
+ assert($decoded == $expected, "Decoded data doesn't match expected for file: $file");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| foreach ($filesToCheck as $file) { | |
| echo "Processing file $file\n"; | |
| simdjson_decode_from_stream(fopen($file, 'r')); | |
| } | |
| foreach ($filesToCheck as $file) { | |
| echo "Processing file $file\n"; | |
| $handle = fopen($file, 'r'); | |
| $decoded = simdjson_decode_from_stream($handle); | |
| fclose($handle); | |
| // Verify decoded data matches expected structure | |
| $expected = json_decode(file_get_contents($file), true); | |
| assert($decoded == $expected, "Decoded data doesn't match expected for file: $file"); | |
| } |
|
|
||
| /** | ||
| * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING | ||
| * | ||
| * If the returned error code is 0, then return_value contains the parsed value. | ||
| * If the returned error code is non-0, then return_value will not be initialized. | ||
| */ | ||
| PHP_SIMDJSON_API simdjson_php_error_code php_simdjson_parse_buffer(simdjson_php_parser* parser, const char *json, size_t len, zval *return_value, bool associative, size_t depth); |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Enhance documentation for buffer padding requirement.
The function requires the input buffer to have simdjson::SIMDJSON_PADDING extra bytes, but this requirement isn't documented in the parameter list. This could lead to buffer overflows if callers aren't aware of the padding requirement.
/**
* Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING
*
+ * @param simdjson_php_parser* parser The parser instance to use
+ * @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated
+ * @param size_t len The length of the JSON string (excluding padding)
+ * @param zval *return_value The zval to store the parsed result
+ * @param bool associative Whether to parse objects as associative arrays
+ * @param size_t depth Maximum nesting depth
+ * @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len
*
* If the returned error code is 0, then return_value contains the parsed value.
* If the returned error code is non-0, then return_value will not be initialized.
*/📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /** | |
| * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING | |
| * | |
| * If the returned error code is 0, then return_value contains the parsed value. | |
| * If the returned error code is non-0, then return_value will not be initialized. | |
| */ | |
| PHP_SIMDJSON_API simdjson_php_error_code php_simdjson_parse_buffer(simdjson_php_parser* parser, const char *json, size_t len, zval *return_value, bool associative, size_t depth); | |
| /** | |
| * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING | |
| * | |
| * @param simdjson_php_parser* parser The parser instance to use | |
| * @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated | |
| * @param size_t len The length of the JSON string (excluding padding) | |
| * @param zval *return_value The zval to store the parsed result | |
| * @param bool associative Whether to parse objects as associative arrays | |
| * @param size_t depth Maximum nesting depth | |
| * @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len | |
| * | |
| * If the returned error code is 0, then return_value contains the parsed value. | |
| * If the returned error code is non-0, then return_value will not be initialized. | |
| */ | |
| PHP_SIMDJSON_API simdjson_php_error_code php_simdjson_parse_buffer(simdjson_php_parser* parser, const char *json, size_t len, zval *return_value, bool associative, size_t depth); |
| // Simplified version of _php_stream_copy_to_mem that allways allocated string with required padding and returns | ||
| // char* instead of of zend_string* to avoid unnecessary overhead | ||
| static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) { | ||
| ssize_t ret = 0; | ||
| char *ptr; | ||
| size_t buflen; | ||
| int step = 8192; | ||
| int min_room = 8192 / 4; | ||
| php_stream_statbuf ssbuf; | ||
| char* result; | ||
|
|
||
| /* avoid many reallocs by allocating a good-sized chunk to begin with, if | ||
| * we can. Note that the stream may be filtered, in which case the stat | ||
| * result may be inaccurate, as the filter may inflate or deflate the | ||
| * number of bytes that we can read. In order to avoid an upsize followed | ||
| * by a downsize of the buffer, overestimate by the step size (which is | ||
| * 8K). */ | ||
| if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) { | ||
| buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step; | ||
| } else { | ||
| buflen = step; | ||
| } | ||
|
|
||
| result = (char*) emalloc(buflen); | ||
| ptr = result; | ||
|
|
||
| while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) { | ||
| *len += ret; | ||
| if (*len + min_room >= buflen) { | ||
| buflen += step; | ||
| result = (char*) erealloc(result, buflen); | ||
| ptr = result + *len; | ||
| } else { | ||
| ptr += ret; | ||
| } | ||
| } | ||
|
|
||
| if (*len == 0) { | ||
| efree(result); | ||
| result = NULL; | ||
| } else if (UNEXPECTED(*len + simdjson::SIMDJSON_PADDING > buflen)) { | ||
| result = (char*) erealloc(result, ZEND_MM_ALIGNED_SIZE(*len + simdjson::SIMDJSON_PADDING)); | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add safety checks for stream size and read timeout.
While the implementation is efficient, consider adding the following safety measures:
- Maximum size limit to prevent memory exhaustion attacks
- Timeout mechanism for slow or hanging streams
- Consider using
php_stream_copy_to_mem_exwith size limit instead of custom implementation
Example implementation with size limit:
static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) {
+ const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit
ssize_t ret = 0;
char *ptr;
size_t buflen;
int step = 8192;
int min_room = 8192 / 4;
php_stream_statbuf ssbuf;
char* result;
if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) {
+ if (ssbuf.sb.st_size > MAX_SIZE) {
+ php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+ return NULL;
+ }
buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step;
} else {
buflen = step;
}
result = (char*) emalloc(buflen);
ptr = result;
while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) {
*len += ret;
+ if (*len > MAX_SIZE) {
+ efree(result);
+ php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size");
+ return NULL;
+ }
if (*len + min_room >= buflen) {
buflen += step;
result = (char*) erealloc(result, buflen);
ptr = result + *len;
} else {
ptr += ret;
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Simplified version of _php_stream_copy_to_mem that allways allocated string with required padding and returns | |
| // char* instead of of zend_string* to avoid unnecessary overhead | |
| static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) { | |
| ssize_t ret = 0; | |
| char *ptr; | |
| size_t buflen; | |
| int step = 8192; | |
| int min_room = 8192 / 4; | |
| php_stream_statbuf ssbuf; | |
| char* result; | |
| /* avoid many reallocs by allocating a good-sized chunk to begin with, if | |
| * we can. Note that the stream may be filtered, in which case the stat | |
| * result may be inaccurate, as the filter may inflate or deflate the | |
| * number of bytes that we can read. In order to avoid an upsize followed | |
| * by a downsize of the buffer, overestimate by the step size (which is | |
| * 8K). */ | |
| if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) { | |
| buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step; | |
| } else { | |
| buflen = step; | |
| } | |
| result = (char*) emalloc(buflen); | |
| ptr = result; | |
| while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) { | |
| *len += ret; | |
| if (*len + min_room >= buflen) { | |
| buflen += step; | |
| result = (char*) erealloc(result, buflen); | |
| ptr = result + *len; | |
| } else { | |
| ptr += ret; | |
| } | |
| } | |
| if (*len == 0) { | |
| efree(result); | |
| result = NULL; | |
| } else if (UNEXPECTED(*len + simdjson::SIMDJSON_PADDING > buflen)) { | |
| result = (char*) erealloc(result, ZEND_MM_ALIGNED_SIZE(*len + simdjson::SIMDJSON_PADDING)); | |
| } | |
| return result; | |
| } | |
| static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) { | |
| const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit | |
| ssize_t ret = 0; | |
| char *ptr; | |
| size_t buflen; | |
| int step = 8192; | |
| int min_room = 8192 / 4; | |
| php_stream_statbuf ssbuf; | |
| char* result; | |
| /* avoid many reallocs by allocating a good-sized chunk to begin with, if | |
| * we can. Note that the stream may be filtered, in which case the stat | |
| * result may be inaccurate, as the filter may inflate or deflate the | |
| * number of bytes that we can read. In order to avoid an upsize followed | |
| * by a downsize of the buffer, overestimate by the step size (which is | |
| * 8K). */ | |
| if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) { | |
| if (ssbuf.sb.st_size > MAX_SIZE) { | |
| php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size"); | |
| return NULL; | |
| } | |
| buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step; | |
| } else { | |
| buflen = step; | |
| } | |
| result = (char*) emalloc(buflen); | |
| ptr = result; | |
| while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) { | |
| *len += ret; | |
| if (*len > MAX_SIZE) { | |
| efree(result); | |
| php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size"); | |
| return NULL; | |
| } | |
| if (*len + min_room >= buflen) { | |
| buflen += step; | |
| result = (char*) erealloc(result, buflen); | |
| ptr = result + *len; | |
| } else { | |
| ptr += ret; | |
| } | |
| } | |
| if (*len == 0) { | |
| efree(result); | |
| result = NULL; | |
| } else if (UNEXPECTED(*len + simdjson::SIMDJSON_PADDING > buflen)) { | |
| result = (char*) erealloc(result, ZEND_MM_ALIGNED_SIZE(*len + simdjson::SIMDJSON_PADDING)); | |
| } | |
| return result; | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/compat/json_decode_invalid_utf8.phpt (1)
20-24: LGTM! Consider adding more edge cases.The exception type update is consistent with the broader changes. The test covers basic invalid UTF-8 scenarios.
Consider adding test cases for:
- UTF-8 sequences longer than 4 bytes
- Incomplete UTF-8 sequences at string boundaries
- UTF-16 surrogate pairs
tests/decode_max_depth_memory_reduction.phpt (1)
30-30: Consider adding memory usage verification.While the test verifies functionality, it would be valuable to confirm the memory optimization claims.
Add memory usage tracking:
foreach ([1024, 1 << 27] as $depth) { + $memory_before = memory_get_usage(true); echo "Test depth=$depth:\n"; $value = simdjson_decode('[]', true, $depth); + $memory_after = memory_get_usage(true); + echo "Memory used: " . ($memory_after - $memory_before) . " bytes\n"; var_dump($value);Also applies to: 35-35
tests/compat/bug69187.phpt (1)
10-10: Document error codes and add constant checks.The test covers error cases well, but could benefit from better documentation and validation.
Add error code validation and documentation:
+// Error code mapping for compatibility testing +$error_map = [ + SIMDJSON_ERR_NO_JSON_FOUND => JSON_ERROR_SYNTAX, + SIMDJSON_ERR_INVALID_UTF8 => JSON_ERROR_UTF8, + SIMDJSON_ERR_UNESCAPED_CHARS => JSON_ERROR_CTRL_CHAR, +]; + function compat_decode($value) { global $lasterr; $lasterr = 0; try { return simdjson_decode($value); } catch (SimdJsonDecoderException $e) { + // Verify error code mapping + assert(isset($error_map[$e->getCode()]), "Unmapped error code: " . $e->getCode()); return sprintf("%s: %d: %s", get_class($e), $e->getCode(), $e->getMessage()); } }Also applies to: 31-32, 36-40
src/simdjson_encoder.h (1)
22-22: LGTM! Consider adding documentation.The renaming of
simdjson_error_codetosimdjson_encoder_error_codeimproves code organization by clearly separating encoder-specific error codes. Consider adding documentation to describe the purpose and usage of each error code.Also applies to: 36-36
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
php_simdjson.cpp(9 hunks)php_simdjson.h(2 hunks)simdjson.stub.php(2 hunks)simdjson_arginfo.h(6 hunks)src/simdjson_decoder.cpp(2 hunks)src/simdjson_encoder.h(2 hunks)tests/compat/bug62010.phpt(1 hunks)tests/compat/bug64874_part2.phpt(2 hunks)tests/compat/bug68546.phpt(1 hunks)tests/compat/bug68938.phpt(1 hunks)tests/compat/bug69187.phpt(2 hunks)tests/compat/fail001.phpt(2 hunks)tests/compat/json_decode_invalid_utf8.phpt(1 hunks)tests/decode_max_depth_memory_reduction.phpt(1 hunks)tests/decode_stream.phpt(1 hunks)tests/key_value_slash.phpt(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/decode_stream.phpt
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Ubuntu-dev
- GitHub Check: Macos (8.3)
- GitHub Check: Ubuntu-dev
🔇 Additional comments (15)
simdjson.stub.php (1)
349-353: LGTM!The new exception classes
SimdJsonDecoderExceptionandSimdJsonEncoderExceptionare properly defined and extendSimdJsonException, enhancing error handling for decoding and encoding processes.tests/compat/bug68938.phpt (1)
7-7: LGTM!Updating the catch block to use
SimdJsonDecoderExceptionand adjusting the expected output aligns with the new exception hierarchy, ensuring that decoding errors are handled appropriately.Also applies to: 12-12
tests/compat/bug62010.phpt (1)
13-13: LGTM!Changing the expected exception to
SimdJsonDecoderExceptioncorrectly reflects the updated exception handling for parsing errors during decoding.tests/compat/bug68546.phpt (1)
26-27: LGTM! Consider validating error codes.The exception type update is consistent. The test properly verifies null byte handling in both objects and arrays.
Consider adding assertions to validate that the error code (255) is the correct constant from the extension:
assert(255 === SIMDJSON_ERR_INVALID_PROPERTY_NAME, "Error code mismatch");Also applies to: 30-30
tests/compat/bug64874_part2.phpt (1)
8-8: LGTM! Exception type updated correctly.The change from
SimdJsonExceptiontoSimdJsonDecoderExceptionaligns with the broader refactoring of exception handling, maintaining test functionality while improving error specificity.tests/key_value_slash.phpt (1)
76-77: LGTM! Exception types consistently updated.The changes systematically update all catch blocks from
SimdJsonExceptiontoSimdJsonDecoderException, maintaining consistency with the new exception handling structure while preserving test functionality.Also applies to: 79-80, 83-85, 88-90, 93-95, 98-100, 103-105, 108-110, 113-115, 118-120, 123-125, 128-130, 133-135, 138-140, 143-145, 148-150, 153-155, 158-160, 163-165, 168-170
tests/compat/fail001.phpt (1)
38-38: LGTM! Exception types consistently updated.The changes systematically update all catch blocks from
SimdJsonExceptiontoSimdJsonDecoderException, maintaining consistency with the new exception handling structure while preserving comprehensive test coverage for various JSON parsing failure scenarios.Also applies to: 44-44, 58-60, 63-65, 68-70, 73-75, 78-80, 83-85, 88-90, 93-95, 98-100, 103-105, 108-110, 113-115, 118-120, 123-125, 128-130, 133-135, 138-140, 143-145, 148-150, 153-155, 158-160, 163-165, 168-170
php_simdjson.h (2)
111-112: LGTM: Exception class split improves error handling granularity.The split into separate decoder and encoder exceptions follows the separation of concerns principle and allows for more specific error handling.
171-177: 🛠️ Refactor suggestionDocument buffer padding and memory safety requirements.
The function documentation should include:
- The required padding size (
simdjson::SIMDJSON_PADDING)- Memory safety considerations
- Parameter descriptions
Apply this diff to improve the documentation:
/** - * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING + * Parses a JSON string buffer into a PHP value. * + * @param simdjson_php_parser* parser The parser instance to use + * @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated + * @param size_t len The length of the JSON string (excluding padding) + * @param zval *return_value The zval to store the parsed result + * @param bool associative Whether to parse objects as associative arrays + * @param size_t depth Maximum nesting depth + * @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len + * @warning The caller must ensure memory safety and proper buffer size + * * If the returned error code is 0, then return_value contains the parsed value. * If the returned error code is non-0, then return_value will not be initialized. */Likely invalid or redundant comment.
simdjson_arginfo.h (2)
17-21: LGTM: Well-defined argument info for stream decoding.The argument info is properly defined with:
- Correct return type (IS_MIXED)
- Resource parameter
- Consistent default values for optional parameters
109-116: LGTM: Clean exception class hierarchy.The exception classes are properly defined with:
- Empty method entries (inheriting parent functionality)
- Correct inheritance from base exception
php_simdjson.cpp (3)
228-275: LGTM: Well-implemented stream decoding function.The function:
- Properly validates parameters
- Handles memory management correctly
- Uses appropriate error handling
531-531: LGTM: Proper use of specialized exceptions.The code correctly uses the new specialized exception classes:
simdjson_encoder_exception_cefor encoding errors- Consistent error message handling
Also applies to: 582-582
176-226:⚠️ Potential issueAdd safety checks for stream operations.
The stream copy function needs additional safety measures:
- Maximum size limit to prevent memory exhaustion
- Timeout mechanism for slow streams
- Consider using
php_stream_copy_to_mem_exwith size limitApply this diff to add safety measures:
static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) { + const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit ssize_t ret = 0; char *ptr; size_t buflen; int step = 8192; int min_room = 8192 / 4; php_stream_statbuf ssbuf; char* result; if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) { + if (ssbuf.sb.st_size > MAX_SIZE) { + php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size"); + return NULL; + } buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step; } else { buflen = step; } result = (char*) emalloc(buflen); ptr = result; while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) { *len += ret; + if (*len > MAX_SIZE) { + efree(result); + php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size"); + return NULL; + } if (*len + min_room >= buflen) { buflen += step; result = (char*) erealloc(result, buflen); ptr = result + *len; } else { ptr += ret; } }Likely invalid or redundant comment.
src/simdjson_decoder.cpp (1)
553-558: LGTM: Well-implemented buffer parsing function.The function:
- Reuses existing parsing logic
- Properly handles memory management
- Maintains consistency with other parsing functions
| /** | ||
| * Takes a JSON encoded string and converts it into a PHP variable. | ||
| * Similar to json_decode() | ||
| * | ||
| * @param resource $res A file system pointer resource that is typically created using fopen(). | ||
| * @param bool $associative When true, JSON objects will be returned as associative arrays. | ||
| * When false, JSON objects will be returned as objects. | ||
| * @param int $depth the maximum nesting depth of the structure being decoded. | ||
| * @return array|stdClass|string|float|int|bool|null | ||
| * @throws SimdJsonException for invalid JSON | ||
| * (or $json over 4GB long, or out of range integer/float) | ||
| * @throws ValueError for invalid $depth | ||
| */ | ||
| function simdjson_decode_from_stream($res, bool $associative = false, int $depth = 512): mixed {} | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Update the function description and @throws annotations to reflect stream input and correct exception
The description for simdjson_decode_from_stream mentions "Takes a JSON encoded string," which is inconsistent with the $res parameter that expects a file system pointer resource. Please update the description to accurately reflect that this function reads and decodes JSON data from a stream.
Additionally, the @throws annotation currently specifies SimdJsonException, but with the introduction of SimdJsonDecoderException, it would be more precise to update it to @throws SimdJsonDecoderException to reflect the specific exception thrown by this function.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
simdjson_arginfo.h (2)
17-21: Consider adding type validation for the resource parameter.The
resparameter lacks type validation in the argument info. Consider usingZEND_ARG_TYPE_INFOwithIS_RESOURCEtype to ensure proper type checking at the PHP level.- ZEND_ARG_INFO(0, res) + ZEND_ARG_TYPE_INFO(0, res, IS_RESOURCE, 0)
159-180: Well-structured exception hierarchy!The new exception hierarchy with specific decoder and encoder exceptions improves error handling granularity. This follows the best practice of having specialized exceptions for different error categories.
Consider documenting the specific use cases for each exception type in the PHP documentation to help users handle errors appropriately.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
.github/workflows/integration.yml(2 hunks)docker_php8.4-dev(1 hunks)php_simdjson.cpp(9 hunks)php_simdjson.h(2 hunks)simdjson.stub.php(5 hunks)simdjson_arginfo.h(7 hunks)src/simdjson_decoder.cpp(2 hunks)src/simdjson_encoder.h(2 hunks)tests/compat/bug62010.phpt(1 hunks)tests/compat/bug64874_part2.phpt(2 hunks)tests/compat/bug68546.phpt(1 hunks)tests/compat/bug68938.phpt(1 hunks)tests/compat/bug69187.phpt(2 hunks)tests/compat/fail001.phpt(2 hunks)tests/compat/json_decode_invalid_utf8.phpt(1 hunks)tests/decode_max_depth_memory_reduction.phpt(1 hunks)tests/decode_stream.phpt(1 hunks)tests/key_value_slash.phpt(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
- .github/workflows/integration.yml
- tests/decode_stream.phpt
- tests/compat/bug69187.phpt
- tests/compat/bug68546.phpt
- tests/compat/json_decode_invalid_utf8.phpt
- tests/compat/fail001.phpt
- tests/compat/bug64874_part2.phpt
- docker_php8.4-dev
- tests/compat/bug62010.phpt
- tests/key_value_slash.phpt
- tests/compat/bug68938.phpt
- tests/decode_max_depth_memory_reduction.phpt
- src/simdjson_encoder.h
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Ubuntu-dev
- GitHub Check: Macos (8.4)
- GitHub Check: Macos (8.3)
- GitHub Check: Macos (8.2)
- GitHub Check: Macos (8.1)
- GitHub Check: Macos (8.0)
- GitHub Check: Ubuntu-dev
🔇 Additional comments (14)
php_simdjson.h (2)
111-112: LGTM! Good separation of concerns.The split into decoder and encoder specific exceptions improves error handling granularity and follows the single responsibility principle.
170-177: 🛠️ Refactor suggestionDocument the buffer padding requirement.
The function requires the input buffer to have
simdjson::SIMDJSON_PADDINGextra bytes, but this requirement isn't documented in the parameter list. This could lead to buffer overflows if callers aren't aware of the padding requirement.Apply this diff to improve the documentation:
/** * Parses the given string into a return code. You have to be sure that the provided JSON contains simdjson::SIMDJSON_PADDING * + * @param simdjson_php_parser* parser The parser instance to use + * @param const char *json The JSON string buffer with SIMDJSON_PADDING extra bytes allocated + * @param size_t len The length of the JSON string (excluding padding) + * @param zval *return_value The zval to store the parsed result + * @param bool associative Whether to parse objects as associative arrays + * @param size_t depth Maximum nesting depth + * @warning The json parameter must have SIMDJSON_PADDING extra bytes allocated after len * * If the returned error code is 0, then return_value contains the parsed value. * If the returned error code is non-0, then return_value will not be initialized. */Likely invalid or redundant comment.
simdjson.stub.php (4)
353-354: LGTM! Good use of abstract class.Making
SimdJsonExceptionabstract enforces that only specific decoder or encoder exceptions can be instantiated, which is the correct design.
355-356: LGTM! Clean implementation of decoder exception.The decoder-specific exception class is well-structured and properly extends the abstract base class.
357-357: LGTM! Clean implementation of encoder exception.The encoder-specific exception class is well-structured and properly extends the abstract base class.
253-266: 🛠️ Refactor suggestionUpdate function description to reflect stream input.
The description incorrectly states "Takes a JSON encoded string" but the function actually takes a file system pointer resource. Update the description to accurately reflect the stream input.
Apply this diff to correct the documentation:
/** - * Takes a JSON encoded string and converts it into a PHP variable. + * Reads and decodes JSON data from a stream into a PHP variable. * Similar to json_decode() * * @param resource $res A file system pointer resource that is typically created using fopen(). * @param bool $associative When true, JSON objects will be returned as associative arrays. * When false, JSON objects will be returned as objects. * @param int $depth the maximum nesting depth of the structure being decoded. * @return array|stdClass|string|float|int|bool|null * @throws SimdJsonDecoderException for invalid JSON * (or $json over 4GB long, or out of range integer/float) * @throws ValueError for invalid $depth */Likely invalid or redundant comment.
php_simdjson.cpp (4)
230-277: LGTM! Well-structured implementation with proper error handling.The implementation is thorough with:
- Proper depth validation
- Memory cleanup on errors
- Consistent error handling pattern
533-534: LGTM! Consistent use of encoder-specific exception.The changes properly use the new
simdjson_encoder_exception_cefor encoding-related errors.Also applies to: 584-585
634-637: LGTM! Proper exception class registration order.The registration follows the correct order by registering the abstract base class first, followed by the specific exception classes.
176-228: 🛠️ Refactor suggestionAdd safety checks for stream size and read timeout.
While the implementation is efficient, consider adding the following safety measures:
- Maximum size limit to prevent memory exhaustion attacks
- Timeout mechanism for slow or hanging streams
- Consider using
php_stream_copy_to_mem_exwith size limit instead of custom implementationExample implementation with size limit:
static char *simdjson_stream_copy_to_mem(php_stream *src, size_t *len) { + const size_t MAX_SIZE = 100 * 1024 * 1024; // 100MB limit ssize_t ret = 0; char *ptr; size_t buflen; int step = 8192; int min_room = 8192 / 4; php_stream_statbuf ssbuf; char* result; if (php_stream_stat(src, &ssbuf) == 0 && ssbuf.sb.st_size > 0) { + if (ssbuf.sb.st_size > MAX_SIZE) { + php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size"); + return NULL; + } buflen = ZEND_MM_ALIGNED_SIZE(MAX(ssbuf.sb.st_size - src->position, 0)) + step; } else { buflen = step; } result = (char*) emalloc(buflen); ptr = result; while ((ret = php_stream_read(src, ptr, buflen - *len)) > 0) { *len += ret; + if (*len > MAX_SIZE) { + efree(result); + php_error_docref(NULL, E_WARNING, "Stream size exceeds maximum allowed size"); + return NULL; + } if (*len + min_room >= buflen) { buflen += step; result = (char*) erealloc(result, buflen); ptr = result + *len; } else { ptr += ret; } }Likely invalid or redundant comment.
src/simdjson_decoder.cpp (2)
49-49: LGTM! Consistent use of decoder-specific exception.The change properly uses the new
simdjson_decoder_exception_cefor decoding-related errors.
553-558: LGTM! Efficient implementation reusing existing parsing logic.The implementation:
- Properly reuses the existing parsing logic
- Correctly sets
realloc_if_neededto false for pre-allocated buffers- Maintains consistency with other parsing functions
simdjson_arginfo.h (2)
75-75: Function registration looks good!The new function is properly declared and registered in the extension functions array, following the established pattern.
Also applies to: 90-90
104-104: Verify __toString behavior matches JSON serialization requirements.While reusing
jsonSerializefor__toStringis efficient, verify that the JSON-serialized output is always suitable for string representation. Consider whether a dedicated string conversion might be more appropriate.Run this script to check the implementation:
Also applies to: 190-190
✅ Verification successful
Reusing jsonSerialize for __toString is correct and safe
The implementation is intentionally designed this way, as both methods should return the same base64 encoded string. Base64 encoding guarantees string-compatible output, making it safe to use jsonSerialize for string conversion.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check the implementation of SimdJsonBase64Encode::jsonSerialize # to verify it always returns a string-compatible value ast-grep --pattern 'class SimdJsonBase64Encode { $$$ function jsonSerialize() { $$$ } $$$ }'Length of output: 106
Script:
#!/bin/bash # First find relevant files containing the class name rg -l "SimdJsonBase64Encode" # Then look for the implementation with context rg -A 5 "function\s+jsonSerialize"Length of output: 409
Script:
#!/bin/bash # Look for the implementation in the cpp file rg -A 10 "zim_SimdJsonBase64Encode_jsonSerialize" php_simdjson.cppLength of output: 66
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Performance Improvements
Compatibility
These updates provide more flexible JSON handling in the simdjson PHP extension, allowing developers to work with JSON data from various sources more efficiently.