feat: add multisheet extraction to JSON extractor by carlbrugger · Pull Request #698 · FlatFilers/flatfile-plugins

carlbrugger · 2024-11-06T17:16:07Z

Please explain how to summarize this PR for the Changelog:

This PR introduces multi-sheet extraction to the JSON extractor plugin.

Tell code reviewer how and what to test:

Run the listener in the sandbox, and upload the test files to a Space.

coderabbitai · 2024-11-07T16:18:37Z

Walkthrough

This pull request introduces multi-sheet extraction functionality to the JSON extractor plugin, allowing users to extract data from multiple sheets in a single operation. It includes updates to various components, such as the addition of a new interface for options in the JSONExtractor, modifications to the listener to incorporate this extractor, and enhancements to the parsing logic and tests to support multi-sheet structures.

Changes

File Path	Change Summary
.changeset/slow-bottles-roll.md	Updated dependency for `@flatfile/plugin-json-extractor` to minor version.
flatfilers/sandbox/src/index.ts	Added import for `JSONExtractor` and modified listener to use `JSONExtractor()`.
plugins/json-extractor/src/index.ts	Introduced `PluginOptions` interface, updated `JSONExtractor` method signature, and marked `jsonExtractorPlugin` as deprecated.
plugins/json-extractor/src/parser.spec.ts	Enhanced test suite for JSON parser with single and multi-sheet parsing tests.
plugins/json-extractor/src/parser.ts	Modified `parseBuffer` and `parseSheet` functions for improved error handling and data validation.

Possibly related PRs

feat: update extractors to use the latest @flatfile/util-extractor@2.1.1 #579: Updates the extractors to use the latest version of @flatfile/util-extractor.
Fix/extractors #589: Modifications to the @flatfile/plugin-json-extractor, directly related to the main PR.
importer: rss feed #641: Introduces new functionality to the index.ts file enhancing listener capabilities.
feat: add skipEmptyLines option #644: Adds a new option for handling empty lines in extractors.
feat: export delimited zip #680: Introduces a new plugin for exporting data as delimited ZIP files.
fixes dataRowAndSubHeaderDetection algorithm #687: Addresses issues with the header detection algorithm.
JSON array -> Flatfile string-list #692: Enhances support for JSON schema array fields.

Suggested reviewers

bangarang

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (10)

.changeset/slow-bottles-roll.md (1)
1-5: Enhance the changelog description with more details.

While the version bump is appropriate for this feature addition, the changelog description could be more informative for users. Consider expanding it to include:

Details about the new PluginOptions interface

Changes to the JSONExtractor signature

Mention of the deprecated jsonExtractorPlugin

Here's a suggested enhancement:
---
'@flatfile/plugin-json-extractor': minor
---

-This release introduces multi-sheet extraction to the JSON extractor plugin.
+This release introduces multi-sheet extraction to the JSON extractor plugin. The update includes:
+
+- Support for extracting data from multiple sheets in a single operation
+- New `PluginOptions` interface for configuring extraction parameters
+- Updated `JSONExtractor` function signature to support the new options
+- Deprecation of `jsonExtractorPlugin` in favor of `JSONExtractor`
plugins/json-extractor/src/index.ts (2)
4-8: Add JSDoc comments to document the PluginOptions interface.

While the interface is well-structured, adding documentation would help users understand the purpose and impact of each option.
+/**
+ * Configuration options for the JSON extractor plugin
+ */
 export interface PluginOptions {
+  /** Maximum size of chunks when processing large JSON files */
   chunkSize?: number
+  /** Number of parallel processing threads */
   parallel?: number
+  /** Enable debug logging */
   debug?: boolean
 }
Line range hint 17-19: Enhance the deprecation notice with migration instructions.

While the deprecation notice is correct, it would be more helpful to provide specific migration instructions.
 /*
- * @deprecated use `JSONExtractor` instead
+ * @deprecated Please use `JSONExtractor` instead.
+ * @example
+ * // Before
+ * import { jsonExtractorPlugin } from '@flatfile/plugin-json-extractor'
+ * 
+ * // After
+ * import { JSONExtractor } from '@flatfile/plugin-json-extractor'
  */
flatfilers/sandbox/src/index.ts (1)
7-7: Consider adding options for better testing capabilities

Since this is a sandbox environment and the PR introduces multi-sheet extraction, it would be beneficial to enable debug mode and configure relevant options. This will help in testing and verifying the new functionality.

Consider updating the configuration:
- listener.use(JSONExtractor())
+ listener.use(JSONExtractor({
+   debug: true,
+   parallel: true,  // if parallel processing is desired
+   chunkSize: 1000  // adjust based on testing needs
+ }))
plugins/json-extractor/src/parser.spec.ts (2)
7-9: Consider using mock data without PII.

The test file contains personal information (names, emails, addresses). Consider using clearly fake data (e.g., "user1@example.com") to avoid any potential PII concerns.

37-158: Refactor test data for better maintainability.

The test data has several issues that could be improved:

Coordinates are duplicated across all records (40.7128° N, 74.0060° W)

The Father object is deeply nested (5 levels) with repetitive patterns

Consider:

Using constants for repeated values

Reducing nesting depth in test data

Using a test data factory function

Example refactor:
const TEST_COORDINATES = {
  latitude: '40.7128° N',
  longitude: '74.0060° W'
};

const createTestPerson = (firstName: string, lastName: string, email: string) => ({
  'First Name': { value: firstName },
  'Last Name': { value: lastName },
  Email: { value: email },
  // ... other fields
  'Address.Coordinates.Latitude': { value: TEST_COORDINATES.latitude },
  'Address.Coordinates.Longitude': { value: TEST_COORDINATES.longitude }
});
plugins/json-extractor/src/parser.ts (4)
23-31: Log a warning when sheet data is not an array to improve debugging.

In the current implementation, if a sheet's value is not an array, it is silently skipped. This could make debugging difficult if the input JSON does not match the expected format.

Consider adding a warning message to inform users when a sheet is ignored due to invalid data:
        for (const [sheet, value] of Object.entries(sheets)) {
          if (Array.isArray(value)) {
            try {
              results[sheet] = parseSheet(value)
            } catch (error) {
              console.error(`Error processing sheet "${sheet}":`, error)
            }
          }
+         else {
+           console.warn(`Sheet "${sheet}" is not an array and has been skipped.`)
+         }
        }
Line range hint 40-85: Enhance type safety by specifying interfaces for input data.

The functions parseBuffer and parseSheet use any types for input parameters and intermediate variables, which can lead to runtime errors and makes the code harder to maintain.

Define interfaces to represent the expected structure of the JSON data:
interface JsonObject {
  [key: string]: any;
}

interface FlattenedRow {
  [header: string]: { value: any };
}
Then update the function signatures and variables:
-export function parseSheet(jsonArray: any[]): SheetCapture {
+export function parseSheet(jsonArray: JsonObject[]): SheetCapture {

...

-        const flattenObject = (obj: any, parent: string = '', res: any = {}) => {
+        const flattenObject = (obj: JsonObject, parent: string = '', res: any = {}) => {

...

-        const filteredData = filteredResults.map((row) => {
+        const filteredData = filteredResults.map((row): FlattenedRow => {
This improvement will help catch type-related errors during development and enhance code readability.

72-77: Optimize data mapping by eliminating redundant computation.

In the data mapping logic, the headers are processed for each row, which can be inefficient for large datasets.

Consider restructuring the data mapping to enhance performance:
        const flattenedData = filteredResults.map((row) => flattenObject(row))
        const filteredData = flattenedData.map((flattenedRow) => {
          const filteredRow = {}
          for (const header of headers) {
            const cell = flattenedRow[header]
            filteredRow[header] = {
              value: Array.isArray(cell) ? JSON.stringify(cell) : cell,
            }
          }
          return filteredRow
        })
This change separates the flattening of rows from the filtering by headers, reducing redundant computations.

1-38: Ensure consistent error handling and logging in parseBuffer.

While errors during JSON parsing are caught and re-thrown, other errors (like invalid input types) return an empty WorkbookCapture. This inconsistency might lead to confusion.

Consider standardizing error handling by either always throwing errors or always returning a consistent error object. This will make the function's behavior more predictable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a85f63a and 7f0168e.

⛔ Files ignored due to path filters (2)

package-lock.json is excluded by !**/package-lock.json, !**/*.json
plugins/json-extractor/ref/test-multisheet.json is excluded by !**/*.json

📒 Files selected for processing (5)

.changeset/slow-bottles-roll.md (1 hunks)
flatfilers/sandbox/src/index.ts (1 hunks)
plugins/json-extractor/src/index.ts (1 hunks)
plugins/json-extractor/src/parser.spec.ts (1 hunks)
plugins/json-extractor/src/parser.ts (3 hunks)

🔇 Additional comments (7)

plugins/json-extractor/src/index.ts (2)

10-12: LGTM! Clean and type-safe implementation.

The function signature update using the new PluginOptions interface improves type safety while maintaining a clean implementation.

1-3: Verify multi-sheet extraction implementation in parseBuffer.

The PR objectives mention adding multi-sheet extraction capability, but the implementation details are not visible in this file. Let's verify the implementation in the parser module.

✅ Verification successful

Let me gather more information about the implementation by checking the test file and the parser implementation.

Multi-sheet extraction is properly implemented

The implementation in parseBuffer and the test cases confirm proper multi-sheet support:

The parser handles both single-sheet (array input) and multi-sheet (object input) JSON files
For single sheets, it automatically creates a "Sheet1" key
For multiple sheets, it processes each key-value pair as separate sheets
Test cases verify both scenarios with test-basic.json and test-multisheet.json

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the parseBuffer implementation for multi-sheet support

# Look for the parseBuffer implementation
echo "Checking parseBuffer implementation:"
ast-grep --pattern 'function parseBuffer' -A 10

# Look for multi-sheet related tests
echo -e "\nChecking related test cases:"
rg -l "multi.*sheet|sheet.*extraction" --type ts

Length of output: 298

Script:

#!/bin/bash
# Check the content of parser.spec.ts for multi-sheet test cases
echo "Content of parser.spec.ts:"
cat plugins/json-extractor/src/parser.spec.ts

# Look for the parser implementation
echo -e "\nChecking parser implementation:"
cat plugins/json-extractor/src/parser.ts

Length of output: 11682

flatfilers/sandbox/src/index.ts (2)

3-3: LGTM: Import statement is correctly placed

The import of JSONExtractor follows the established pattern of importing Flatfile plugins.

7-7: Verify JSONExtractor compatibility with existing export actions

The sandbox environment includes multiple sheets with export actions. Please verify that the JSONExtractor works harmoniously with the existing export functionality, particularly the export-delimited-zip and export-external-api actions.

✅ Verification successful

Let me gather more information about the JSONExtractor implementation and its interaction with the export plugins.

Based on the gathered information, I can now provide the final response:

JSONExtractor is compatible with existing export actions

The implementation shows that:

JSONExtractor is properly integrated as a plugin and doesn't interfere with export operations
Export handlers (export-delimited-zip and export-external-api) operate independently
File handlers explicitly skip processing export files with if (file.mode === 'export') { return }

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential conflicts between JSONExtractor and existing export actions

# Search for export-related action handlers that might interact with JSONExtractor
rg -A 5 "export-delimited-zip|export-external-api" --type ts

# Look for any existing JSON extraction implementations that might conflict
ast-grep --pattern 'export function $_($$_) {
  $$$
  JSON.$_($$$)
  $$$
}'

Length of output: 1814

Script:

#!/bin/bash
# Check how JSONExtractor is used in the codebase and its interaction with other plugins
rg -A 5 "JSONExtractor" --type ts

# Look for any event handlers or hooks that might process the data before export
ast-grep --pattern 'listener.on($_, async ($$$) => {
  $$$
})'

# Check for any data transformation or processing before export
ast-grep --pattern 'export const $_ = {
  $$$
  transform: $_($$$) {
    $$$
  }
  $$$
}'

Length of output: 11237

plugins/json-extractor/src/parser.spec.ts (2)

1-4: Well-structured test organization!

The test file is properly organized with clear imports and a logical separation of test cases into distinct describe blocks.

3-3: Verify test coverage for parseSheet function.

The parseSheet function is imported but not directly tested. Let's verify its usage and test coverage.

✅ Verification successful

Let me gather more information about the test coverage by checking the actual test file content.

Let me check the implementation of parseSheet to understand how it's used within parseBuffer.

Test coverage for parseSheet function is sufficient through parseBuffer tests

The parseSheet function is indirectly but thoroughly tested through the parseBuffer tests in parser.spec.ts. The test file includes comprehensive test cases that exercise parseSheet's functionality:

Single sheet parsing with nested JSON structures
Multi-sheet parsing with different data structures
Edge cases with various data types and nested levels
Full validation of the flattened structure and header generation

The parseSheet function is an internal implementation detail that's properly encapsulated and tested through its parent function parseBuffer, which is the main public API. The existing test coverage adequately verifies both the functionality and edge cases of parseSheet.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check test coverage and usage of parseSheet function

# Find the implementation of parseSheet
echo "Searching for parseSheet implementation..."
ast-grep --pattern 'export function parseSheet($_) {
  $$$
}'

# Find all uses of parseSheet in tests
echo "Searching for parseSheet usage in tests..."
rg 'parseSheet\(' --type ts

Length of output: 463

Script:

#!/bin/bash
# Check the content of the test file
echo "Content of parser.spec.ts:"
cat plugins/json-extractor/src/parser.spec.ts

# Also check if there are any indirect tests through parseBuffer
echo -e "\nSearching for parseBuffer tests that might cover parseSheet:"
rg 'parseBuffer\(' -A 5 --type ts

Length of output: 20777

Script:

#!/bin/bash
# Get the implementation of parseSheet and its context
echo "Content of parser.ts:"
cat plugins/json-extractor/src/parser.ts

Length of output: 2768

plugins/json-extractor/src/parser.ts (1)

43-48: Handle empty or invalid JSON arrays gracefully.

If the input JSON array is empty or contains no valid objects after filtering, the function returns an empty SheetCapture. Depending on downstream processing, this might lead to unexpected behavior.

Ensure that the calling code can handle empty sheets appropriately. You might also consider returning an error or a more informative message.

plugins/json-extractor/src/parser.spec.ts

coderabbitai · 2024-11-07T16:18:49Z

plugins/json-extractor/src/parser.ts

    const flattenObject = (obj: any, parent: string = '', res: any = {}) => {
      for (let key in obj) {
-        const propName = parent ? parent + '.' + key : key
+        const propName = parent ? `${parent}.${key}` : key


⚠️ Potential issue

Prevent runtime errors by handling null values in flattenObject.

Currently, if an object property is null, the function flattenObject will attempt to iterate over it, leading to a runtime error. This is because typeof null returns 'object', causing the recursion to be invoked on null.

Apply this diff to handle null values appropriately:

- if (typeof obj[key] === 'object') { + if (typeof obj[key] === 'object' && obj[key] !== null) {

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

plugins/json-extractor/src/parser.spec.ts (2)
7-166: Add error handling test cases for single sheet parsing.

While the happy path is well tested, consider adding test cases for error scenarios:

Invalid JSON structure

Missing required fields

Malformed nested structures

Example test case:
it('handles invalid JSON structure', () => {
  const invalidBuffer = Buffer.from('{"headers": [], "data": [{"incomplete": }]}');
  expect(() => parseBuffer(invalidBuffer)).toThrow();
});
167-238: Add sheet-specific validation tests.

While the basic multi-sheet functionality is tested, consider adding tests for sheet-specific scenarios:

Sheet name collisions

Cross-sheet references or dependencies

Sheets with different character encodings

Maximum sheet name length validation

Example test case:
it('handles sheet name collisions', () => {
  const duplicateSheetBuffer = Buffer.from(JSON.stringify({
    sheet1: { headers: ['A'], data: [{ A: { value: 1 } }] },
    'sheet1': { headers: ['B'], data: [{ B: { value: 2 } }] }
  }));
  expect(() => parseBuffer(duplicateSheetBuffer)).toThrow(/duplicate sheet name/i);
});

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0168e and ef871e1.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json, !**/*.json

📒 Files selected for processing (1)

plugins/json-extractor/src/parser.spec.ts (1 hunks)

🔇 Additional comments (1)

plugins/json-extractor/src/parser.spec.ts (1)

4-4: LGTM: Import statement correctly added.

The new parseSheet function is properly imported from the parser module.

feat: add multisheet extraction to JSON extractor

7f0168e

carlbrugger mentioned this pull request Nov 6, 2024

Adds a new JSON multi-sheet extractor plugin #697

Closed

amarbakir approved these changes Nov 7, 2024

View reviewed changes

carlbrugger marked this pull request as ready for review November 7, 2024 16:11

carlbrugger requested a review from bangarang November 7, 2024 16:15

coderabbitai bot reviewed Nov 7, 2024

View reviewed changes

bangarang approved these changes Nov 7, 2024

View reviewed changes

Merge branch 'main' into feat/multisheet-json-extractor-option

ef871e1

coderabbitai bot reviewed Nov 7, 2024

View reviewed changes

carlbrugger merged commit 25836cc into main Nov 7, 2024

carlbrugger deleted the feat/multisheet-json-extractor-option branch November 7, 2024 20:30

coderabbitai bot mentioned this pull request Nov 20, 2024

Adding JSONL support to json extractor #705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: add multisheet extraction to JSON extractor#698

feat: add multisheet extraction to JSON extractor#698
carlbrugger merged 2 commits intomainfrom
feat/multisheet-json-extractor-option

carlbrugger commented Nov 6, 2024

Uh oh!

coderabbitai bot commented Nov 7, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Nov 7, 2024

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

carlbrugger commented Nov 6, 2024

Please explain how to summarize this PR for the Changelog:

Tell code reviewer how and what to test:

Uh oh!

coderabbitai bot commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Nov 7, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)