Adding JSONL support to json extractor#705
Conversation
WalkthroughThe changes in this pull request enhance the JSON extraction functionality by introducing new JSON objects in the Changes
Possibly related PRs
Suggested reviewers
Warning Rate limit exceeded@carlbrugger has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 24 minutes and 40 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (6)
plugins/json-extractor/src/index.ts (1)
11-11: Document supported file extensionsConsider adding JSDoc comments to document the supported file extensions and their expected format requirements.
+/** + * Creates a JSON/JSONL extractor plugin + * @param options Plugin options + * @returns Extractor function + * @supports File extensions: .json (JSON), .jsonl/.jsonlines (JSON Lines) + */ export const JSONExtractor = (options?: PluginOptions) => { return Extractor(/\.(jsonl?|jsonlines)$/i, 'json', parseBuffer, options) }plugins/json-extractor/src/parser.ts (1)
Line range hint
19-23: Improve error messages and type handling.While the validation is good, consider these improvements:
- More descriptive error messages
- Proper object construction instead of type casting
const parsedData = JSON.parse(fileContents) if (typeof parsedData !== 'object' || parsedData === null) { - console.error('Invalid input: data must be an object.') - return {} as WorkbookCapture + console.error('Invalid input: expected a JSON object or array, received:', typeof parsedData) + return { Sheet1: { headers: [], data: [] } } }utils/extractor/src/index.ts (1)
82-82: LGTM! Consider improving type safety.The addition of
fileExtto the parseBuffer options is correct and aligns with the PR objectives to support JSONL files. However, theoptions: anytype in the parseBuffer function signature could be improved for better type safety.Consider updating the parseBuffer function signature to use a strongly typed options interface:
- parseBuffer: (buffer: Buffer, options: any) => WorkbookCapture | Promise<WorkbookCapture>, + interface ParserOptions { + fileExt?: string; + headerSelectionEnabled?: boolean; + [key: string]: any; // For other optional properties + } + parseBuffer: (buffer: Buffer, options: ParserOptions) => WorkbookCapture | Promise<WorkbookCapture>,plugins/json-extractor/src/parser.spec.ts (3)
8-158: Consider improving test data maintainabilityWhile the test data structure is comprehensive, consider these improvements:
- Extract repeated structures (like Father hierarchy) into helper functions
- Vary the coordinate values across records to catch potential data mixing issues
- Add comments explaining the purpose of deep nesting levels
Example refactor for better maintainability:
const createFatherHierarchy = (level: number) => { const hierarchy: Record<string, {value: string}> = {}; for (let i = 1; i <= level; i++) { const prefix = 'Father.' + '.Father.'.repeat(i-1); hierarchy[`${prefix}First Name`] = { value: `Father_First_${i}` }; hierarchy[`${prefix}Last Name`] = { value: `Father_Last_${i}` }; } return hierarchy; }; const createTestPerson = (firstName: string, lastName: string, email: string, address: any) => ({ 'First Name': { value: firstName }, 'Last Name': { value: lastName }, 'Email': { value: email }, ...createAddressFields(address), ...createFatherHierarchy(5) });
160-167: Enhance test coverage with error scenariosThe happy path is well tested, but consider adding test cases for:
- Malformed JSON
- Empty file
- Invalid nested structures
Example additional test cases:
it('handles malformed json gracefully', () => { const buffer = Buffer.from('{invalid json}'); expect(() => parseBuffer(buffer)).toThrow(); }); it('handles empty file gracefully', () => { const buffer = Buffer.from(''); expect(() => parseBuffer(buffer)).toThrow(); });
Line range hint
1-178: Overall implementation looks good with room for improvementThe implementation successfully adds JSONL support while maintaining compatibility with existing JSON functionality. The test structure is clean and well-organized.
Consider these architectural improvements:
- Add a test utilities file for common test data creation
- Consider using test.each for testing multiple file formats
- Add performance tests for large JSONL files
Example test.each implementation:
test.each([ ['json', '../ref/test-basic.json', undefined], ['jsonl', '../ref/test-basic.jsonl', 'jsonl'] ])('has a single sheet from %s input', (format, filePath, fileExt) => { const buffer = fs.readFileSync(path.join(__dirname, filePath)); const singleSheetCapture = parseBuffer(buffer, fileExt ? { fileExt } : undefined); expect(singleSheetCapture).toEqual(expectedSingleSheetCapture); });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
plugins/json-extractor/ref/test-basic.jsonl(1 hunks)plugins/json-extractor/src/index.ts(1 hunks)plugins/json-extractor/src/parser.spec.ts(1 hunks)plugins/json-extractor/src/parser.ts(1 hunks)utils/extractor/src/index.ts(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- plugins/json-extractor/ref/test-basic.jsonl
🔇 Additional comments (4)
plugins/json-extractor/src/index.ts (1)
11-11: Verify JSONL parsing behavior
The regex pattern correctly matches .json, .jsonl, and .jsonlines extensions, but we should verify that the parser correctly differentiates between regular JSON and JSONL formats.
✅ Verification successful
JSONL parsing implementation is correctly handled
The implementation properly differentiates between JSON and JSONL formats:
- For JSONL files, it converts newline-separated JSON objects into a valid JSON array by adding brackets and commas
- The parser correctly handles both formats by checking the file extension (
fileExt === 'jsonl' || fileExt === 'jsonlines') - Test coverage exists with a dedicated JSONL test case using a real JSONL file
- The expected output is identical for both JSON and JSONL inputs, ensuring format consistency
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if parseBuffer handles different file formats appropriately
# Search for parseBuffer implementation and format handling
ast-grep --pattern 'function parseBuffer($_) {
$$$
}'
# Look for any test files that verify different formats
fd -e test.ts -e spec.ts -X rg -A 5 'jsonl|jsonlines'
Length of output: 875
Script:
#!/bin/bash
# Let's check the parseBuffer implementation and test file content
# Check parseBuffer implementation in parser.ts
ast-grep --pattern 'export function parseBuffer($_) {
$$$
}'
# Check the test file content
cat plugins/json-extractor/src/parser.spec.ts
# Check if there's a test reference file
cat plugins/json-extractor/ref/test-basic.jsonl
Length of output: 10912
Script:
#!/bin/bash
# Let's check the parser.ts file to see how it handles different formats
# Check the parser implementation
cat plugins/json-extractor/src/parser.ts
# Also check if there are any other test files that might contain JSONL tests
fd -e test.ts -e spec.ts -X rg -l 'fileExt'
Length of output: 3017
plugins/json-extractor/src/parser.ts (2)
3-6: LGTM! Function signature change is well-structured.
The addition of the optional options parameter with fileExt property is well-typed and follows TypeScript best practices.
Line range hint 3-23: Verify test coverage and integration.
Please ensure:
- Test coverage for JSONL parsing edge cases
- Integration tests with the extractor system
✅ Verification successful
Let me check the test coverage and implementation details before making the final decision.
JSONL parsing implementation is properly tested and integrated
The verification shows:
- JSONL parsing is tested with a comprehensive test file containing multiple records
- Test file includes complex nested JSON structures, validating proper handling
- Integration with the extractor system is confirmed through the test implementation using
parseBufferwithfileExtoption - The implementation correctly handles the conversion of JSONL to JSON array format
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check test coverage for JSONL handling
echo "Checking for JSONL test cases..."
rg -l "test.*jsonl" --type ts
# Check integration points
echo "Checking extractor integration..."
rg -l "parseBuffer.*fileExt" --type ts
# Look for existing JSONL files in tests
echo "Looking for JSONL test files..."
fd -e jsonl -e jsonlines
Length of output: 452
Script:
#!/bin/bash
# Check the test implementation for JSONL
echo "Checking JSONL test implementation..."
rg "test.*jsonl" -B 2 -A 10 plugins/json-extractor/src/parser.spec.ts
# Check the test file content
echo "Checking test file content..."
cat plugins/json-extractor/ref/test-basic.jsonl
# Check integration test implementation
echo "Checking integration test implementation..."
rg "parseBuffer.*fileExt" -B 2 -A 5 plugins/json-extractor/src/parser.spec.ts
Length of output: 2956
utils/extractor/src/index.ts (1)
Line range hint 9-17: Verify parseBuffer implementations for JSONL support.
Since the Extractor is a higher-order function that accepts parseBuffer implementations, we should ensure all implementations properly handle the new fileExt option.
Let's verify the parseBuffer implementations:
✅ Verification successful
Let me gather more information about the parseBuffer implementations to ensure we have a complete picture.
Let me do one final check to find all parseBuffer implementations across different plugins.
Based on the evidence gathered from all the shell script executions, I can now provide a final response:
JSONL support is properly implemented in the JSON extractor plugin
The parseBuffer implementation in the JSON extractor plugin correctly handles the fileExt option for both 'jsonl' and 'jsonlines' extensions, with proper transformation of JSONL format to JSON array. This is also verified by test coverage in parser.spec.ts.
- ✓ Implementation in
plugins/json-extractor/src/parser.tshandles both.jsonland.jsonlinesextensions - ✓ The Extractor's fileExt validation in
utils/extractor/src/index.tsproperly validates file extensions - ✓ Test coverage exists in
plugins/json-extractor/src/parser.spec.ts
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all parseBuffer implementations and verify JSONL support
# Find files that might contain parseBuffer implementations
echo "Searching for parseBuffer implementations..."
rg -l "parseBuffer.*=.*\(.*buffer.*,.*options" --type ts
# Check for JSONL file extension handling
echo "Checking for JSONL extension handling..."
rg -A 5 "fileExt.*['\"]jsonl" --type ts
rg -A 5 "fileExt.*['\"]jsonlines" --type ts
Length of output: 1761
Script:
#!/bin/bash
# Let's find all parseBuffer implementations and their handling of options
ast-grep --pattern 'const parseBuffer = ($_, $_) => {
$$$
}'
# Also check for any other file extension handling in parsers
rg -A 5 "fileExt.*['\"]" --type ts
Length of output: 1942
Script:
#!/bin/bash
# Find all files in plugins directory that might contain parseBuffer implementations
fd . plugins -t f -e ts -x rg -l "parseBuffer.*=.*function|parseBuffer.*=.*\(.*\).*=>"
# Check the implementation details of parseBuffer in these files
fd . plugins -t f -e ts -x rg -A 10 "parseBuffer.*=.*function|parseBuffer.*=.*\(.*\).*=>"
Length of output: 180
df8b40a to
380383f
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
.changeset/tender-lizards-report.md (1)
1-6: Enhance the release notes with more details.The version bumps are appropriate for this feature addition. However, consider expanding the release notes to be more informative for users by:
- Mentioning the specific file extensions supported (
.jsonland.jsonlines)- Adding a brief description of what JSONL (JSON Lines) format is
- Including any relevant usage examples or documentation links
Here's a suggested enhancement:
--- '@flatfile/plugin-json-extractor': minor '@flatfile/util-extractor': minor --- -This release adds support for JSONL files in the JSON Extractor plugin. +This release adds support for JSONL (JSON Lines) files in the JSON Extractor plugin. +Files with extensions `.jsonl` and `.jsonlines` are now supported. JSONL format +allows storing structured data as multiple JSON objects, one per line, making it +ideal for streaming and large datasets.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json,!**/*.json
📒 Files selected for processing (6)
.changeset/tender-lizards-report.md(1 hunks)plugins/json-extractor/ref/test-basic.jsonl(1 hunks)plugins/json-extractor/src/index.ts(1 hunks)plugins/json-extractor/src/parser.spec.ts(1 hunks)plugins/json-extractor/src/parser.ts(1 hunks)utils/extractor/src/index.ts(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- plugins/json-extractor/ref/test-basic.jsonl
- plugins/json-extractor/src/index.ts
- plugins/json-extractor/src/parser.spec.ts
- plugins/json-extractor/src/parser.ts
- utils/extractor/src/index.ts
eb71412 to
5a8bee8
Compare
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
plugins/json-extractor/src/parser.ts (1)
3-6: LGTM! Consider adding JSDoc documentation.The function signature change is well-structured and maintains backward compatibility. The readonly modifier on the options type provides good immutability guarantees.
Consider adding JSDoc documentation to describe the new parameter and supported file extensions:
+/** + * Parses a buffer containing JSON or JSONL data into a WorkbookCapture + * @param buffer - The input buffer containing the file contents + * @param options - Optional parsing configuration + * @param options.fileExt - File extension to determine parsing strategy ('jsonl'|'jsonlines') + * @returns A WorkbookCapture containing the parsed data + */ export function parseBuffer( buffer: Buffer, options?: { readonly fileExt?: string } ): WorkbookCapture {
🛑 Comments failed to post (1)
plugins/json-extractor/src/parser.ts (1)
15-30:
⚠️ Potential issueConsider memory and security implications.
While the JSONL parsing implementation is functional, there are several concerns to address:
- Memory Usage: Loading the entire file into memory could be problematic for large files
- Security: No limit on number of lines could lead to DOS attacks
- Performance: Multiple array iterations could be optimized
Consider implementing streaming for large files and adding safety limits:
if (options?.fileExt === 'jsonl' || options?.fileExt === 'jsonlines') { + const MAX_LINES = 1000000; // Configure based on your requirements + let lineCount = 0; const lines = fileContents .split('\n') .filter((line) => line.trim() !== '') .map((line) => { + if (++lineCount > MAX_LINES) { + throw new Error(`File exceeds maximum limit of ${MAX_LINES} lines`); + } try { JSON.parse(line) return line } catch (e) { - console.error('Invalid JSON line:', line) + console.error(`Invalid JSON at line ${lineCount}:`, e.message) return null } }) .filter((line) => line !== null) fileContents = `[${lines.join(',')}]` }For better scalability, consider implementing a streaming parser for large files:
- Use Node.js readline interface or a streaming JSON parser
- Process lines incrementally instead of loading entire file
- Implement backpressure handling
Would you like me to provide an example implementation using streams?
Committable suggestion skipped: line range outside the PR's diff.
carlbrugger
left a comment
There was a problem hiding this comment.
Nice work on this! 🙌🏼
Adding JSONL input file support to json extractor plugin.
It extends the current functionality with capability of parsing
.jsonland.jsonlinesfile formats.