Skip to content

Adding JSONL support to json extractor#705

Merged
carlbrugger merged 5 commits intoFlatFilers:mainfrom
csibe17:feat/jsonl-support-v1
Nov 22, 2024
Merged

Adding JSONL support to json extractor#705
carlbrugger merged 5 commits intoFlatFilers:mainfrom
csibe17:feat/jsonl-support-v1

Conversation

@csibe17
Copy link
Contributor

@csibe17 csibe17 commented Nov 20, 2024

Adding JSONL input file support to json extractor plugin.
It extends the current functionality with capability of parsing .jsonl and .jsonlines file formats.

@csibe17 csibe17 marked this pull request as ready for review November 20, 2024 23:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 20, 2024

Walkthrough

The changes in this pull request enhance the JSON extraction functionality by introducing new JSON objects in the test-basic.jsonl file, representing individuals with detailed attributes. The JSONExtractor function has been modified to support multiple file extensions, including .jsonl and .jsonlines. The parseBuffer function has been updated to handle these new file types, and the test suite has been revised to improve clarity and include tests for JSONL input.

Changes

File Path Change Summary
plugins/json-extractor/ref/test-basic.jsonl Added three new JSON objects representing individuals with attributes such as name, email, address, and a multi-layered family structure.
plugins/json-extractor/src/index.ts Updated JSONExtractor to accept multiple file extensions using a regex. Marked jsonExtractorPlugin as deprecated.
plugins/json-extractor/src/parser.spec.ts Introduced expectedSingleSheetCapture variable for expected output in tests. Updated tests to use this variable and added new test cases for JSONL input.
plugins/json-extractor/src/parser.ts Modified parseBuffer to accept an optional options parameter for handling different file extensions and improved error handling for empty contents.
utils/extractor/src/index.ts Updated the Extractor function to include a fileExt property in the options passed to parseBuffer, enhancing context for parsing different file types.

Possibly related PRs

  • feat: add multisheet extraction to JSON extractor #698: The main PR introduces multi-sheet extraction functionality to the JSON extractor, which is directly related to the changes made in the main PR that adds new JSON objects with nested structures, enhancing the overall data extraction capabilities.

Suggested reviewers

  • bangarang

Warning

Rate limit exceeded

@carlbrugger has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 24 minutes and 40 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between eb71412 and 5a8bee8.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between eb71412 and 5a8bee8.

📒 Files selected for processing (2)
  • plugins/json-extractor/src/parser.spec.ts (1 hunks)
  • plugins/json-extractor/src/parser.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • plugins/json-extractor/src/parser.spec.ts
  • plugins/json-extractor/src/parser.ts

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (6)
plugins/json-extractor/src/index.ts (1)

11-11: Document supported file extensions

Consider adding JSDoc comments to document the supported file extensions and their expected format requirements.

+/**
+ * Creates a JSON/JSONL extractor plugin
+ * @param options Plugin options
+ * @returns Extractor function
+ * @supports File extensions: .json (JSON), .jsonl/.jsonlines (JSON Lines)
+ */
 export const JSONExtractor = (options?: PluginOptions) => {
   return Extractor(/\.(jsonl?|jsonlines)$/i, 'json', parseBuffer, options)
 }
plugins/json-extractor/src/parser.ts (1)

Line range hint 19-23: Improve error messages and type handling.

While the validation is good, consider these improvements:

  1. More descriptive error messages
  2. Proper object construction instead of type casting
     const parsedData = JSON.parse(fileContents)
     if (typeof parsedData !== 'object' || parsedData === null) {
-      console.error('Invalid input: data must be an object.')
-      return {} as WorkbookCapture
+      console.error('Invalid input: expected a JSON object or array, received:', typeof parsedData)
+      return { Sheet1: { headers: [], data: [] } }
     }
utils/extractor/src/index.ts (1)

82-82: LGTM! Consider improving type safety.

The addition of fileExt to the parseBuffer options is correct and aligns with the PR objectives to support JSONL files. However, the options: any type in the parseBuffer function signature could be improved for better type safety.

Consider updating the parseBuffer function signature to use a strongly typed options interface:

- parseBuffer: (buffer: Buffer, options: any) => WorkbookCapture | Promise<WorkbookCapture>,
+ interface ParserOptions {
+   fileExt?: string;
+   headerSelectionEnabled?: boolean;
+   [key: string]: any;  // For other optional properties
+ }
+ parseBuffer: (buffer: Buffer, options: ParserOptions) => WorkbookCapture | Promise<WorkbookCapture>,
plugins/json-extractor/src/parser.spec.ts (3)

8-158: Consider improving test data maintainability

While the test data structure is comprehensive, consider these improvements:

  1. Extract repeated structures (like Father hierarchy) into helper functions
  2. Vary the coordinate values across records to catch potential data mixing issues
  3. Add comments explaining the purpose of deep nesting levels

Example refactor for better maintainability:

const createFatherHierarchy = (level: number) => {
  const hierarchy: Record<string, {value: string}> = {};
  for (let i = 1; i <= level; i++) {
    const prefix = 'Father.' + '.Father.'.repeat(i-1);
    hierarchy[`${prefix}First Name`] = { value: `Father_First_${i}` };
    hierarchy[`${prefix}Last Name`] = { value: `Father_Last_${i}` };
  }
  return hierarchy;
};

const createTestPerson = (firstName: string, lastName: string, email: string, address: any) => ({
  'First Name': { value: firstName },
  'Last Name': { value: lastName },
  'Email': { value: email },
  ...createAddressFields(address),
  ...createFatherHierarchy(5)
});

160-167: Enhance test coverage with error scenarios

The happy path is well tested, but consider adding test cases for:

  1. Malformed JSON
  2. Empty file
  3. Invalid nested structures

Example additional test cases:

it('handles malformed json gracefully', () => {
  const buffer = Buffer.from('{invalid json}');
  expect(() => parseBuffer(buffer)).toThrow();
});

it('handles empty file gracefully', () => {
  const buffer = Buffer.from('');
  expect(() => parseBuffer(buffer)).toThrow();
});

Line range hint 1-178: Overall implementation looks good with room for improvement

The implementation successfully adds JSONL support while maintaining compatibility with existing JSON functionality. The test structure is clean and well-organized.

Consider these architectural improvements:

  1. Add a test utilities file for common test data creation
  2. Consider using test.each for testing multiple file formats
  3. Add performance tests for large JSONL files

Example test.each implementation:

test.each([
  ['json', '../ref/test-basic.json', undefined],
  ['jsonl', '../ref/test-basic.jsonl', 'jsonl']
])('has a single sheet from %s input', (format, filePath, fileExt) => {
  const buffer = fs.readFileSync(path.join(__dirname, filePath));
  const singleSheetCapture = parseBuffer(buffer, fileExt ? { fileExt } : undefined);
  expect(singleSheetCapture).toEqual(expectedSingleSheetCapture);
});
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0eabe86 and df8b40a.

📒 Files selected for processing (5)
  • plugins/json-extractor/ref/test-basic.jsonl (1 hunks)
  • plugins/json-extractor/src/index.ts (1 hunks)
  • plugins/json-extractor/src/parser.spec.ts (1 hunks)
  • plugins/json-extractor/src/parser.ts (1 hunks)
  • utils/extractor/src/index.ts (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • plugins/json-extractor/ref/test-basic.jsonl
🔇 Additional comments (4)
plugins/json-extractor/src/index.ts (1)

11-11: Verify JSONL parsing behavior

The regex pattern correctly matches .json, .jsonl, and .jsonlines extensions, but we should verify that the parser correctly differentiates between regular JSON and JSONL formats.

✅ Verification successful

JSONL parsing implementation is correctly handled

The implementation properly differentiates between JSON and JSONL formats:

  • For JSONL files, it converts newline-separated JSON objects into a valid JSON array by adding brackets and commas
  • The parser correctly handles both formats by checking the file extension (fileExt === 'jsonl' || fileExt === 'jsonlines')
  • Test coverage exists with a dedicated JSONL test case using a real JSONL file
  • The expected output is identical for both JSON and JSONL inputs, ensuring format consistency
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if parseBuffer handles different file formats appropriately

# Search for parseBuffer implementation and format handling
ast-grep --pattern 'function parseBuffer($_) {
  $$$
}'

# Look for any test files that verify different formats
fd -e test.ts -e spec.ts -X rg -A 5 'jsonl|jsonlines'

Length of output: 875


Script:

#!/bin/bash
# Let's check the parseBuffer implementation and test file content

# Check parseBuffer implementation in parser.ts
ast-grep --pattern 'export function parseBuffer($_) {
  $$$
}'

# Check the test file content
cat plugins/json-extractor/src/parser.spec.ts

# Check if there's a test reference file
cat plugins/json-extractor/ref/test-basic.jsonl

Length of output: 10912


Script:

#!/bin/bash
# Let's check the parser.ts file to see how it handles different formats

# Check the parser implementation
cat plugins/json-extractor/src/parser.ts

# Also check if there are any other test files that might contain JSONL tests
fd -e test.ts -e spec.ts -X rg -l 'fileExt'

Length of output: 3017

plugins/json-extractor/src/parser.ts (2)

3-6: LGTM! Function signature change is well-structured.

The addition of the optional options parameter with fileExt property is well-typed and follows TypeScript best practices.


Line range hint 3-23: Verify test coverage and integration.

Please ensure:

  1. Test coverage for JSONL parsing edge cases
  2. Integration tests with the extractor system
✅ Verification successful

Let me check the test coverage and implementation details before making the final decision.


JSONL parsing implementation is properly tested and integrated

The verification shows:

  • JSONL parsing is tested with a comprehensive test file containing multiple records
  • Test file includes complex nested JSON structures, validating proper handling
  • Integration with the extractor system is confirmed through the test implementation using parseBuffer with fileExt option
  • The implementation correctly handles the conversion of JSONL to JSON array format
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check test coverage for JSONL handling
echo "Checking for JSONL test cases..."
rg -l "test.*jsonl" --type ts

# Check integration points
echo "Checking extractor integration..."
rg -l "parseBuffer.*fileExt" --type ts

# Look for existing JSONL files in tests
echo "Looking for JSONL test files..."
fd -e jsonl -e jsonlines

Length of output: 452


Script:

#!/bin/bash
# Check the test implementation for JSONL
echo "Checking JSONL test implementation..."
rg "test.*jsonl" -B 2 -A 10 plugins/json-extractor/src/parser.spec.ts

# Check the test file content
echo "Checking test file content..."
cat plugins/json-extractor/ref/test-basic.jsonl

# Check integration test implementation
echo "Checking integration test implementation..."
rg "parseBuffer.*fileExt" -B 2 -A 5 plugins/json-extractor/src/parser.spec.ts

Length of output: 2956

utils/extractor/src/index.ts (1)

Line range hint 9-17: Verify parseBuffer implementations for JSONL support.

Since the Extractor is a higher-order function that accepts parseBuffer implementations, we should ensure all implementations properly handle the new fileExt option.

Let's verify the parseBuffer implementations:

✅ Verification successful

Let me gather more information about the parseBuffer implementations to ensure we have a complete picture.


Let me do one final check to find all parseBuffer implementations across different plugins.


Based on the evidence gathered from all the shell script executions, I can now provide a final response:

JSONL support is properly implemented in the JSON extractor plugin

The parseBuffer implementation in the JSON extractor plugin correctly handles the fileExt option for both 'jsonl' and 'jsonlines' extensions, with proper transformation of JSONL format to JSON array. This is also verified by test coverage in parser.spec.ts.

  • ✓ Implementation in plugins/json-extractor/src/parser.ts handles both .jsonl and .jsonlines extensions
  • ✓ The Extractor's fileExt validation in utils/extractor/src/index.ts properly validates file extensions
  • ✓ Test coverage exists in plugins/json-extractor/src/parser.spec.ts
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all parseBuffer implementations and verify JSONL support

# Find files that might contain parseBuffer implementations
echo "Searching for parseBuffer implementations..."
rg -l "parseBuffer.*=.*\(.*buffer.*,.*options" --type ts

# Check for JSONL file extension handling
echo "Checking for JSONL extension handling..."
rg -A 5 "fileExt.*['\"]jsonl" --type ts
rg -A 5 "fileExt.*['\"]jsonlines" --type ts

Length of output: 1761


Script:

#!/bin/bash
# Let's find all parseBuffer implementations and their handling of options
ast-grep --pattern 'const parseBuffer = ($_, $_) => {
  $$$
}'

# Also check for any other file extension handling in parsers
rg -A 5 "fileExt.*['\"]" --type ts

Length of output: 1942


Script:

#!/bin/bash
# Find all files in plugins directory that might contain parseBuffer implementations
fd . plugins -t f -e ts -x rg -l "parseBuffer.*=.*function|parseBuffer.*=.*\(.*\).*=>"

# Check the implementation details of parseBuffer in these files
fd . plugins -t f -e ts -x rg -A 10 "parseBuffer.*=.*function|parseBuffer.*=.*\(.*\).*=>"

Length of output: 180

@carlbrugger carlbrugger force-pushed the feat/jsonl-support-v1 branch from df8b40a to 380383f Compare November 22, 2024 17:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
.changeset/tender-lizards-report.md (1)

1-6: Enhance the release notes with more details.

The version bumps are appropriate for this feature addition. However, consider expanding the release notes to be more informative for users by:

  • Mentioning the specific file extensions supported (.jsonl and .jsonlines)
  • Adding a brief description of what JSONL (JSON Lines) format is
  • Including any relevant usage examples or documentation links

Here's a suggested enhancement:

---
'@flatfile/plugin-json-extractor': minor
'@flatfile/util-extractor': minor
---

-This release adds support for JSONL files in the JSON Extractor plugin.
+This release adds support for JSONL (JSON Lines) files in the JSON Extractor plugin. 
+Files with extensions `.jsonl` and `.jsonlines` are now supported. JSONL format 
+allows storing structured data as multiple JSON objects, one per line, making it 
+ideal for streaming and large datasets.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between df8b40a and 84097a0.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json, !**/*.json
📒 Files selected for processing (6)
  • .changeset/tender-lizards-report.md (1 hunks)
  • plugins/json-extractor/ref/test-basic.jsonl (1 hunks)
  • plugins/json-extractor/src/index.ts (1 hunks)
  • plugins/json-extractor/src/parser.spec.ts (1 hunks)
  • plugins/json-extractor/src/parser.ts (1 hunks)
  • utils/extractor/src/index.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • plugins/json-extractor/ref/test-basic.jsonl
  • plugins/json-extractor/src/index.ts
  • plugins/json-extractor/src/parser.spec.ts
  • plugins/json-extractor/src/parser.ts
  • utils/extractor/src/index.ts

@carlbrugger carlbrugger force-pushed the feat/jsonl-support-v1 branch from eb71412 to 5a8bee8 Compare November 22, 2024 17:05
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
plugins/json-extractor/src/parser.ts (1)

3-6: LGTM! Consider adding JSDoc documentation.

The function signature change is well-structured and maintains backward compatibility. The readonly modifier on the options type provides good immutability guarantees.

Consider adding JSDoc documentation to describe the new parameter and supported file extensions:

+/**
+ * Parses a buffer containing JSON or JSONL data into a WorkbookCapture
+ * @param buffer - The input buffer containing the file contents
+ * @param options - Optional parsing configuration
+ * @param options.fileExt - File extension to determine parsing strategy ('jsonl'|'jsonlines')
+ * @returns A WorkbookCapture containing the parsed data
+ */
 export function parseBuffer(
   buffer: Buffer,
   options?: { readonly fileExt?: string }
 ): WorkbookCapture {
🛑 Comments failed to post (1)
plugins/json-extractor/src/parser.ts (1)

15-30: ⚠️ Potential issue

Consider memory and security implications.

While the JSONL parsing implementation is functional, there are several concerns to address:

  1. Memory Usage: Loading the entire file into memory could be problematic for large files
  2. Security: No limit on number of lines could lead to DOS attacks
  3. Performance: Multiple array iterations could be optimized

Consider implementing streaming for large files and adding safety limits:

 if (options?.fileExt === 'jsonl' || options?.fileExt === 'jsonlines') {
+  const MAX_LINES = 1000000; // Configure based on your requirements
+  let lineCount = 0;
   const lines = fileContents
     .split('\n')
     .filter((line) => line.trim() !== '')
     .map((line) => {
+      if (++lineCount > MAX_LINES) {
+        throw new Error(`File exceeds maximum limit of ${MAX_LINES} lines`);
+      }
       try {
         JSON.parse(line)
         return line
       } catch (e) {
-        console.error('Invalid JSON line:', line)
+        console.error(`Invalid JSON at line ${lineCount}:`, e.message)
         return null
       }
     })
     .filter((line) => line !== null)
   fileContents = `[${lines.join(',')}]`
 }

For better scalability, consider implementing a streaming parser for large files:

  1. Use Node.js readline interface or a streaming JSON parser
  2. Process lines incrementally instead of loading entire file
  3. Implement backpressure handling

Would you like me to provide an example implementation using streams?

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@carlbrugger carlbrugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this! 🙌🏼

@carlbrugger carlbrugger merged commit 7ef4cd3 into FlatFilers:main Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants