Skip to content

Adds a new JSON multi-sheet extractor plugin#697

Closed
amarbakir wants to merge 5 commits intoFlatFilers:mainfrom
aprimetechnology:main
Closed

Adds a new JSON multi-sheet extractor plugin#697
amarbakir wants to merge 5 commits intoFlatFilers:mainfrom
aprimetechnology:main

Conversation

@amarbakir
Copy link

@amarbakir amarbakir commented Nov 5, 2024

Please explain how to summarize this PR for the Changelog:

  • Initialize JSON multi-sheet extractor plugin
  • Refactor out shared code from plugins/json-extractor into utils/extractor
  • Add utils/extractor tests
  • Simplify JSON extractor/multisheet extractor test cases based on core extraction logic now living in utils/extractor
  • Add example of Array and primitive (int) parsing to unit test

Tell code reviewer how and what to test:

The plugins/json-extractor and plugins/json-multisheet-extractor plugins. Existing behavior should not have been changed

Requirements:

  • Must allow users to upload a single JSON file which results in data correctly uploading to the corresponding sheets
  • Should be able to determine sheet to data relationships through top level JSON keys. E.g.
{
	"contacts": [ // sheet 1
		{ // record 1
			"firstName": "John", // header 1, value
            		"lastName": "Doe", // header 2, value
            		"email": "john.doe@example.com"
		},
		{ // record 2
			"firstName": "Jane", // header 1, value
        			"lastName": "Smith",
        			"email": "jane.smith@example.com"
		},
	],
	"orders": [ // sheet 2
		{ // record 1
			"id": "ORD123456", // header 1, value
            		"amount": 250 // header 2, value
		},
	]
}

Resulting UI somewhat supports this, but a nice next step would be to allow for a multisheet import in one step through the UI, or at the very least not require one to go back to the files list to reimport the second sheet (load data from sheet contacts -> import data to contacts & load data from sheet order -> import data to order all in one step):
Screenshot 2024-10-30 at 6 32 43 PM

As part of this effort, we surfaced this issue that has now been resolved: #695

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2024

Walkthrough

This pull request introduces updates to two Flatfile plugins: @flatfile/plugin-json-extractor, which has received a minor version update, and @flatfile/plugin-json-multisheet-extractor, which has undergone a patch version update. The changes also include the initialization of the JSON multi-sheet extractor plugin and significant modifications to the parseBuffer function in the JSON extractor, streamlining data processing. Additionally, new test suites and documentation for the multi-sheet extractor plugin have been added.

Changes

File Path Change Summary
.changeset/slow-bottles-roll.md Updated versions for @flatfile/plugin-json-extractor (minor) and @flatfile/plugin-json-multisheet-extractor (patch). Initialized JSON multi-sheet extractor plugin.
plugins/json-extractor/src/parser.spec.ts Removed "Father" entries from expected output and headers in test cases for parseBuffer function.
plugins/json-extractor/src/parser.ts Modified parseBuffer function to replace filtering and flattening logic with parseSheet call. Updated condition for returning empty WorkbookCapture and introduced isEmpty function.
plugins/json-multisheet-extractor/CHANGELOG.md Noted initial release of @flatfile/plugin-json-multisheet-extractor version 0.1.0.
plugins/json-multisheet-extractor/README.md Added README for @flatfile/json-multisheet-extractor, detailing functionality, parameters, API calls, and usage examples.
plugins/json-multisheet-extractor/jest.config.cjs Introduced Jest configuration for Node.js testing environment with TypeScript support.
plugins/json-multisheet-extractor/src/index.ts Added JSONMultiSheetExtractor function and jsonParser export, utilizing parseBuffer.
plugins/json-multisheet-extractor/src/parser.spec.ts Introduced a test suite for parseBuffer, verifying output structure for contacts and orders.
plugins/json-multisheet-extractor/src/parser.ts Added parseBuffer function to process JSON data into WorkbookCapture format with error handling.
plugins/json-multisheet-extractor/tsup.config.mjs Added configuration file for tsup build process.
utils/extractor/src/index.spec.ts Introduced a test suite for parseSheet function, validating conversion of JSON to structured format.
utils/extractor/src/index.ts Added parseSheet function to process JSON objects into SheetCapture format with error handling and nested property flattening.

Possibly related PRs

Suggested reviewers

  • bangarang

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (18)
plugins/json-multisheet-extractor/src/index.ts (2)

4-10: Add JSDoc documentation and improve type safety.

Consider adding documentation and type improvements:

  1. Add JSDoc to describe the purpose and parameters
  2. Consider creating an interface for the options object
  3. Add explicit return type
+/**
+ * Creates a JSON multi-sheet extractor that processes JSON files into multiple sheets
+ * @param options Configuration options
+ * @param options.chunkSize Number of records to process in each chunk
+ * @param options.parallel Number of parallel processing threads
+ * @param options.debug Enable debug logging
+ * @returns Configured extractor instance
+ */
+interface ExtractorOptions {
+  chunkSize?: number;
+  parallel?: number;
+  debug?: boolean;
+}
-export const JSONMultiSheetExtractor = (options?: {
-  chunkSize?: number
-  parallel?: number
-  debug?: boolean
-}) => {
+export const JSONMultiSheetExtractor = (options?: ExtractorOptions): ReturnType<typeof Extractor> => {

12-12: Add documentation for the jsonParser export.

Consider adding JSDoc to explain why this function is exported separately and its multi-sheet parsing capabilities.

+/**
+ * Direct access to the multi-sheet JSON parsing functionality
+ * @see parseBuffer
+ */
 export const jsonParser = parseBuffer
plugins/json-multisheet-extractor/src/parser.ts (2)

2-3: Remove extra empty line.

A single empty line after imports is sufficient for readability.

 import { WorkbookCapture, parseSheet } from '@flatfile/util-extractor'

-
 export function parseBuffer(buffer: Buffer): WorkbookCapture {

38-41: Enhance error context for debugging.

The error handling should provide more context about where the failure occurred.

   } catch (error) {
-    console.error('An error occurred:', error)
-    throw error
+    const wrappedError = new Error(
+      `Failed to parse JSON multi-sheet data: ${error.message}`
+    )
+    wrappedError.cause = error
+    throw wrappedError
   }
plugins/json-multisheet-extractor/src/parser.spec.ts (3)

6-8: Consider using a more robust way to handle test data paths.

The hardcoded path to the test data file could break if the directory structure changes. Consider using environment variables or a test configuration file to manage test data paths.

-  const buffer: Buffer = fs.readFileSync(
-    path.join(__dirname, '../ref/test-multisheet.json')
-  )
+  const TEST_DATA_DIR = process.env.TEST_DATA_DIR || path.join(__dirname, '../ref')
+  const buffer: Buffer = fs.readFileSync(
+    path.join(TEST_DATA_DIR, 'test-multisheet.json')
+  )

10-25: Add type definitions for header constants.

Consider adding type definitions to improve maintainability and ensure consistency with the actual implementation.

+type HeaderType = readonly string[]
+
-  const CONTACT_HEADERS = [
+  const CONTACT_HEADERS: HeaderType = [
     'First Name',
     // ...
   ];

-  const ORDER_HEADERS = [
+  const ORDER_HEADERS: HeaderType = [
     'ID',
     'Amount',
   ]

85-92: Consider removing redundant header tests.

These header validation tests are redundant as they're already covered in the main 'JSON to WorkbookCapture' test. Consider removing them or testing specific header-related edge cases instead.

If you want to keep header-specific tests, consider testing edge cases like:

test('handles headers with special characters', () => {
  // Test with headers containing spaces, dots, etc.
})

test('maintains header order', () => {
  // Verify that header order matches the input JSON structure
})
plugins/json-multisheet-extractor/README.md (6)

1-35: Replace hard tabs with spaces in the JSON example.

The JSON example effectively illustrates the plugin's functionality. However, for better markdown consistency, replace hard tabs with spaces.

 {
-	"contacts": [ // sheet 1
-		{ // record 1
-			"firstName": "John", // header 1, value
+    "contacts": [ // sheet 1
+        { // record 1
+            "firstName": "John", // header 1, value
🧰 Tools
🪛 Markdownlint

8-8: Column: 1
Hard tabs

(MD010, no-hard-tabs)


9-9: Column: 1
Hard tabs

(MD010, no-hard-tabs)


10-10: Column: 1
Hard tabs

(MD010, no-hard-tabs)


13-13: Column: 1
Hard tabs

(MD010, no-hard-tabs)


14-14: Column: 1
Hard tabs

(MD010, no-hard-tabs)


15-15: Column: 1
Hard tabs

(MD010, no-hard-tabs)


18-18: Column: 1
Hard tabs

(MD010, no-hard-tabs)


19-19: Column: 1
Hard tabs

(MD010, no-hard-tabs)


20-20: Column: 1
Hard tabs

(MD010, no-hard-tabs)


21-21: Column: 1
Hard tabs

(MD010, no-hard-tabs)


22-22: Column: 1
Hard tabs

(MD010, no-hard-tabs)


24-24: Column: 1
Hard tabs

(MD010, no-hard-tabs)


25-25: Column: 1
Hard tabs

(MD010, no-hard-tabs)


41-51: Enhance parameter documentation with specific recommendations.

Consider adding:

  1. Recommended ranges for chunkSize based on typical JSON file sizes
  2. Guidelines for selecting parallel values based on server resources
  3. Memory usage estimates for different parameter combinations

54-66: Add descriptions and sequence information for API calls.

The API calls list would be more helpful with:

  1. Brief descriptions of each API's purpose
  2. The sequence in which these APIs are called during the extraction process
  3. Example responses or potential error scenarios

89-89: Fix grammatical error in the description.

Add the missing preposition "in":

-When a JSON file is uploaded, the plugin will extract the structured data and process it the extractor's parser.
+When a JSON file is uploaded, the plugin will extract the structured data and process it in the extractor's parser.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~89-~89: Possible missing preposition found.
Context: ...ract the structured data and process it the extractor's parser. ```javascript impo...

(AI_HYDRA_LEO_MISSING_IN)


91-101: Enhance the code example with error handling.

The example would be more complete with:

  1. Error handling for file processing failures
  2. Configuration examples for the optional parameters
  3. Comments explaining the expected outcomes

1-104: Overall documentation provides good coverage but could be enhanced.

The README effectively documents the new plugin's functionality and aligns well with the PR objectives. However, consider adding:

  1. A troubleshooting section for common issues
  2. Performance benchmarks or recommendations
  3. Limitations or constraints of the plugin
🧰 Tools
🪛 LanguageTool

[uncategorized] ~89-~89: Possible missing preposition found.
Context: ...ract the structured data and process it the extractor's parser. ```javascript impo...

(AI_HYDRA_LEO_MISSING_IN)

🪛 Markdownlint

8-8: Column: 1
Hard tabs

(MD010, no-hard-tabs)


9-9: Column: 1
Hard tabs

(MD010, no-hard-tabs)


10-10: Column: 1
Hard tabs

(MD010, no-hard-tabs)


13-13: Column: 1
Hard tabs

(MD010, no-hard-tabs)


14-14: Column: 1
Hard tabs

(MD010, no-hard-tabs)


15-15: Column: 1
Hard tabs

(MD010, no-hard-tabs)


18-18: Column: 1
Hard tabs

(MD010, no-hard-tabs)


19-19: Column: 1
Hard tabs

(MD010, no-hard-tabs)


20-20: Column: 1
Hard tabs

(MD010, no-hard-tabs)


21-21: Column: 1
Hard tabs

(MD010, no-hard-tabs)


22-22: Column: 1
Hard tabs

(MD010, no-hard-tabs)


24-24: Column: 1
Hard tabs

(MD010, no-hard-tabs)


25-25: Column: 1
Hard tabs

(MD010, no-hard-tabs)

utils/extractor/src/index.spec.ts (2)

1-8: LGTM! Consider adding error handling for file reading.

The test setup is well-structured. However, it would be beneficial to handle potential file reading errors, especially since this is a critical setup step for the tests.

 const buffer: Buffer = fs.readFileSync(
   path.join(__dirname, '../ref/test-array-to-sheetcapture.json')
 )
+
+// Ensure the file exists and can be parsed
+expect(buffer).toBeTruthy()

13-97: Improve test data maintainability.

The test data is comprehensive but contains repetitive structures. Consider:

  1. Extracting common data into constants
  2. Using test data factories
  3. Adding specific assertions for different aspects (structure, types, undefined handling)

Example refactor:

const COMMON_FATHER_HIERARCHY = {
  'Father.First Name': { value: 'Father_First_1' },
  'Father.Last Name': { value: 'Father_Last_1' },
  'Father.Father.First Name': { value: 'Father_First_2' },
  'Father.Father.Last Name': { value: 'Father_Last_2' },
  'Father.Father.Father.First Name': { value: 'Father_First_3' },
  'Father.Father.Father.Last Name': { value: 'Father_Last_3' },
};

const COMMON_COORDINATES = {
  'Address.Coordinates.Latitude': { value: '40.7128° N' },
  'Address.Coordinates.Longitude': { value: '74.0060° W' },
};

// Then use spread operator in test data
{
  'First Name': { value: 'Tony' },
  ...COMMON_FATHER_HIERARCHY,
  ...COMMON_COORDINATES,
  // ... other unique fields
}
utils/extractor/src/index.ts (2)

332-334: Consider preserving array structure instead of stringifying.

Converting arrays to strings loses the data structure and makes it harder for users to work with array data.

Consider preserving arrays or providing configuration options:

filteredRow[header] = {
  value: Array.isArray(cell) ? cell : cell,
  type: Array.isArray(cell) ? 'array' : typeof cell
};

297-347: Consider optimizing for large datasets and improving type safety.

While the function integrates well with the existing codebase, there are a few suggestions for improvement:

  1. For large datasets, consider implementing streaming or chunked processing to avoid memory issues
  2. Add input validation for the workbook structure

Consider these improvements:

  1. Add batch processing:
const BATCH_SIZE = 1000;

export async function* parseSheetInBatches(
  jsonArray: any[],
  batchSize: number = BATCH_SIZE
): AsyncGenerator<Partial<SheetCapture>> {
  // Process headers from first item
  const firstBatch = jsonArray.slice(0, 1);
  const { headers } = parseSheet(firstBatch);
  
  // Process data in batches
  for (let i = 0; i < jsonArray.length; i += batchSize) {
    const batch = jsonArray.slice(i, i + batchSize);
    const { data } = parseSheet(batch);
    yield { headers, data };
  }
}
  1. Add input validation:
function validateWorkbookStructure(data: any[]): void {
  if (!Array.isArray(data)) {
    throw new Error('Input must be an array');
  }
  
  if (data.some(item => typeof item !== 'object' || item === null)) {
    throw new Error('All items must be non-null objects');
  }
}
plugins/json-extractor/src/parser.ts (1)

31-39: Simplify the 'isEmpty' function using 'Object.keys'.

The custom isEmpty function can be simplified by utilizing Object.keys(obj).length === 0, which is a more idiomatic and concise way to check if an object is empty in JavaScript.

Consider updating the code as follows:

function isEmpty(obj) {
  return Object.keys(obj).length === 0;
}

Alternatively, you can remove the isEmpty function altogether and directly modify the condition at line 18:

- if (isEmpty(sheetCapture)) {
+ if (Object.keys(sheetCapture).length === 0) {

This approach reduces code complexity and leverages built-in JavaScript capabilities.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a85f63a and 8bd720a.

⛔ Files ignored due to path filters (7)
  • package-lock.json is excluded by !**/package-lock.json, !**/*.json
  • plugins/json-extractor/package.json is excluded by !**/*.json
  • plugins/json-extractor/ref/test-basic.json is excluded by !**/*.json
  • plugins/json-multisheet-extractor/package.json is excluded by !**/*.json
  • plugins/json-multisheet-extractor/ref/test-multisheet.json is excluded by !**/*.json
  • utils/extractor/package.json is excluded by !**/*.json
  • utils/extractor/ref/test-array-to-sheetcapture.json is excluded by !**/*.json
📒 Files selected for processing (12)
  • .changeset/slow-bottles-roll.md (1 hunks)
  • plugins/json-extractor/src/parser.spec.ts (0 hunks)
  • plugins/json-extractor/src/parser.ts (2 hunks)
  • plugins/json-multisheet-extractor/CHANGELOG.md (1 hunks)
  • plugins/json-multisheet-extractor/README.md (1 hunks)
  • plugins/json-multisheet-extractor/jest.config.cjs (1 hunks)
  • plugins/json-multisheet-extractor/src/index.ts (1 hunks)
  • plugins/json-multisheet-extractor/src/parser.spec.ts (1 hunks)
  • plugins/json-multisheet-extractor/src/parser.ts (1 hunks)
  • plugins/json-multisheet-extractor/tsup.config.mjs (1 hunks)
  • utils/extractor/src/index.spec.ts (1 hunks)
  • utils/extractor/src/index.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • plugins/json-extractor/src/parser.spec.ts
✅ Files skipped from review due to trivial changes (4)
  • .changeset/slow-bottles-roll.md
  • plugins/json-multisheet-extractor/CHANGELOG.md
  • plugins/json-multisheet-extractor/jest.config.cjs
  • plugins/json-multisheet-extractor/tsup.config.mjs
🧰 Additional context used
🪛 LanguageTool
plugins/json-multisheet-extractor/README.md

[uncategorized] ~89-~89: Possible missing preposition found.
Context: ...ract the structured data and process it the extractor's parser. ```javascript impo...

(AI_HYDRA_LEO_MISSING_IN)

🪛 Markdownlint
plugins/json-multisheet-extractor/README.md

8-8: Column: 1
Hard tabs

(MD010, no-hard-tabs)


9-9: Column: 1
Hard tabs

(MD010, no-hard-tabs)


10-10: Column: 1
Hard tabs

(MD010, no-hard-tabs)


13-13: Column: 1
Hard tabs

(MD010, no-hard-tabs)


14-14: Column: 1
Hard tabs

(MD010, no-hard-tabs)


15-15: Column: 1
Hard tabs

(MD010, no-hard-tabs)


18-18: Column: 1
Hard tabs

(MD010, no-hard-tabs)


19-19: Column: 1
Hard tabs

(MD010, no-hard-tabs)


20-20: Column: 1
Hard tabs

(MD010, no-hard-tabs)


21-21: Column: 1
Hard tabs

(MD010, no-hard-tabs)


22-22: Column: 1
Hard tabs

(MD010, no-hard-tabs)


24-24: Column: 1
Hard tabs

(MD010, no-hard-tabs)


25-25: Column: 1
Hard tabs

(MD010, no-hard-tabs)

🔇 Additional comments (4)
plugins/json-multisheet-extractor/src/index.ts (2)

1-2: LGTM! Clean and well-organized imports.

The imports demonstrate good separation of concerns by utilizing the shared extraction utility while keeping parsing logic local to the plugin.


4-10: Verify the usage of configuration parameters.

Let's verify how these configuration parameters affect the extraction process and ensure they're properly documented in tests.

✅ Verification successful

Let me gather more specific information about how these configuration parameters are used.


Let me check one more thing to verify the actual implementation of these parameters.


Let me check one final thing to understand how these parameters are used in the extractor utility.


Configuration parameters are properly implemented and documented

The configuration parameters are well-implemented and documented:

  • chunkSize (default: 5,000): Controls the number of records processed in each chunk
  • parallel (default: 1): Controls parallel processing of chunks
  • debug (default: false): Enables debug logging

The parameters are:

  • Properly documented in the README.md
  • Correctly implemented in the Extractor utility (utils/extractor/src/index.ts)
  • Used for performance optimization and debugging purposes
  • Have sensible default values

The implementation shows these parameters are effectively used during the extraction process, particularly during job execution and record processing. No issues were found with their implementation or documentation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check test coverage for configuration parameters

# Test 1: Look for tests covering different configuration options
echo "Searching for test cases with different configurations..."
rg -l "JSONMultiSheetExtractor.*chunkSize|parallel|debug" --type ts

# Test 2: Check if there are examples or documentation
echo "Searching for documentation or examples..."
fd -e md -e txt . | xargs rg -l "JSONMultiSheetExtractor.*configuration|options"

Length of output: 2975


Script:

#!/bin/bash
# Check implementation details of the configuration parameters

# Look for the actual implementation of parseBuffer function
echo "Checking parseBuffer implementation..."
ast-grep --pattern 'function parseBuffer($$$)'

# Check the Extractor implementation to see how options are used
echo "Checking Extractor implementation..."
rg -A 10 "function Extractor.*options" utils/extractor/src/index.ts

# Check specific tests for json-multisheet-extractor
echo "Checking specific tests..."
fd ".*\.spec\.ts|.*\.test\.ts" plugins/json-multisheet-extractor/

# Check documentation content
echo "Checking documentation content..."
cat plugins/json-multisheet-extractor/README.md

Length of output: 3508


Script:

#!/bin/bash
# Check the parser implementation and tests

# Check parser implementation
echo "Checking parser implementation..."
cat plugins/json-multisheet-extractor/src/parser.spec.ts

# Check if parameters are used in the actual parser
echo "Checking parser usage of parameters..."
rg -A 5 "chunkSize|parallel|debug" plugins/json-multisheet-extractor/src/parser.ts

Length of output: 3180


Script:

#!/bin/bash
# Check the extractor utility implementation

echo "Checking extractor utility implementation..."
cat utils/extractor/src/index.ts

echo "Checking for any related test files..."
fd -e spec.ts -e test.ts . utils/extractor/

Length of output: 10147

plugins/json-multisheet-extractor/src/parser.spec.ts (1)

6-8: Verify the existence of test data file.

Let's ensure the reference test file exists in the correct location.

✅ Verification successful

Test data file exists and has the expected structure

The verification confirms that:

  • The test file exists at the correct path: plugins/json-multisheet-extractor/ref/test-multisheet.json
  • The JSON file contains the required contacts and orders properties
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of test data file

# Test: Check if the test data file exists
fd -t f "test-multisheet.json" plugins/json-multisheet-extractor/ref/

# Test: Verify the JSON structure in the test file
if [ -f "plugins/json-multisheet-extractor/ref/test-multisheet.json" ]; then
  echo "Validating JSON structure..."
  cat plugins/json-multisheet-extractor/ref/test-multisheet.json | jq 'has("contacts") and has("orders")'
fi

Length of output: 378

utils/extractor/src/index.spec.ts (1)

6-8: Verify test file existence and contents.

Let's ensure the test file exists and contains the expected structure.

✅ Verification successful

Test file verification successful

The test file exists and contains valid JSON data with the expected array structure. The sample data shows a well-formed object with nested properties, suitable for testing array-to-sheet capture functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the test file exists and validate its structure
if [ -f "utils/extractor/ref/test-array-to-sheetcapture.json" ]; then
  echo "Test file exists"
  # Validate JSON structure
  cat "utils/extractor/ref/test-array-to-sheetcapture.json" | jq 'type'
  # Show sample of contents
  cat "utils/extractor/ref/test-array-to-sheetcapture.json" | jq '.[0]'
else
  echo "Test file not found!"
fi

Length of output: 893

Comment on lines +4 to +6
export function parseBuffer(buffer: Buffer): WorkbookCapture {
try {
const fileContents = buffer.toString('utf8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add input validation for buffer parameter.

The function should validate that the buffer parameter is not null or undefined.

 export function parseBuffer(buffer: Buffer): WorkbookCapture {
   try {
+    if (!Buffer.isBuffer(buffer)) {
+      throw new Error('Input must be a valid Buffer')
+    }
     const fileContents = buffer.toString('utf8')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export function parseBuffer(buffer: Buffer): WorkbookCapture {
try {
const fileContents = buffer.toString('utf8')
export function parseBuffer(buffer: Buffer): WorkbookCapture {
try {
if (!Buffer.isBuffer(buffer)) {
throw new Error('Input must be a valid Buffer')
}
const fileContents = buffer.toString('utf8')

Comment on lines +8 to +11
if (!fileContents) {
console.error('Invalid file contents')
return {} as WorkbookCapture
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve empty content validation.

The current error handling could be more informative and type-safe.

 if (!fileContents) {
-      console.error('Invalid file contents')
-      return {} as WorkbookCapture
+      console.error('File contents are empty or invalid')
+      return Object.create(null) as WorkbookCapture
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (!fileContents) {
console.error('Invalid file contents')
return {} as WorkbookCapture
}
if (!fileContents) {
console.error('File contents are empty or invalid')
return Object.create(null) as WorkbookCapture
}

Comment on lines +26 to +32
try {
const result = parseSheet(value);
results[sheet] = result; // Store result in the results object
} catch (error) {
console.error(`Error processing sheet "${sheet}":`, error);
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in sheet processing.

The current error handling swallows errors silently. Consider:

  1. Collecting errors for all sheets
  2. Providing more context about the failure
-          try {
-              const result = parseSheet(value);
-              results[sheet] = result; // Store result in the results object
-          } catch (error) {
-              console.error(`Error processing sheet "${sheet}":`, error);
-          }
+          try {
+              if (value.length === 0) {
+                  console.warn(`Sheet "${sheet}" is empty`);
+                  results[sheet] = { data: [], fields: [] };
+                  return;
+              }
+              const result = parseSheet(value);
+              results[sheet] = result;
+          } catch (error) {
+              const errorMessage = `Failed to process sheet "${sheet}": ${error.message}`;
+              console.error(errorMessage);
+              throw new Error(errorMessage);
+          }

Committable suggestion skipped: line range outside the PR's diff.

return {} as WorkbookCapture
}

const parsedData = JSON.parse(fileContents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add input validation and size limits to prevent JSON-based attacks

The current implementation lacks crucial security measures:

  • No file size limits before parsing
  • No JSON depth validation
  • Basic type checking but no structural validation
  • No limits on number of sheets or records

Specific locations requiring attention:

  • plugins/json-multisheet-extractor/src/parser.ts:13: Add size check before parsing
  • plugins/json-multisheet-extractor/src/parser.ts:22: Add depth validation and sheet count limits
  • plugins/json-multisheet-extractor/src/parser.ts:25: Add array size limits for each sheet
🔗 Analysis chain

Add safeguards against malicious JSON input.

Consider implementing:

  1. JSON size limits to prevent memory exhaustion
  2. Depth limits to prevent stack overflow attacks
  3. Validation of sheet names to prevent injection

Let's check if there are any size limits implemented elsewhere:


Let me gather more information about any existing validation or security measures in the JSON plugin:


Let me check one more thing to see if there are any size limits in the extractor utility or parent classes:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for size limit implementations in the codebase
rg -i "size.*limit|maxSize|max.*length" --type ts

Length of output: 887


Script:

#!/bin/bash
# Check for validation or security measures in the JSON plugin
rg -A 5 "parse|validate|check" plugins/json-multisheet-extractor/

# Look for any configuration or options that might include limits
ast-grep --pattern 'interface $_Config {
  $$$
}'  plugins/json-multisheet-extractor/

# Check if there's any error handling around the JSON parsing
rg -B 2 -A 2 "JSON.parse" plugins/json-multisheet-extractor/

Length of output: 7726


Script:

#!/bin/bash
# Check the extractor utility for any size limits
rg -B 2 -A 5 "Extractor|Buffer" "@flatfile/util-extractor"

# Also check the direct file contents
cat plugins/json-multisheet-extractor/src/parser.ts

Length of output: 1424

Comment on lines +27 to +84
test('JSON to WorkbookCapture', () => {
const capture = parseBuffer(buffer)
expect(capture).toEqual({
contacts: {
headers: CONTACT_HEADERS,
data: [
{
'First Name': { value: 'Tony' },
'Last Name': { value: 'Lamb' },
Email: { value: 'me@opbaj.tp' },
'Address.Street': { value: '123 Main Street' },
'Address.City': { value: 'Springfield' },
'Address.State': { value: 'ST' },
'Address.Zip': { value: '12345' },
'Address.Coordinates.Latitude': { value: '40.7128° N' },
'Address.Coordinates.Longitude': { value: '74.0060° W' },
},
{
'First Name': { value: 'Christian' },
'Last Name': { value: 'Ramos' },
Email: { value: 'uw@ag.tg' },
'Address.Street': { value: '456 Elm Street' },
'Address.City': { value: 'Greenville' },
'Address.State': { value: 'GT' },
'Address.Zip': { value: '67890' },
'Address.Coordinates.Latitude': { value: '40.7128° N' },
'Address.Coordinates.Longitude': { value: '74.0060° W' },
},
{
'First Name': { value: 'Frederick' },
'Last Name': { value: 'Boyd' },
Email: { value: 'kempur@ascebec.gs' },
'Address.Street': { value: '789 Oak Street' },
'Address.City': { value: 'Rivertown' },
'Address.State': { value: 'RT' },
'Address.Zip': { value: '10112' },
'Address.Coordinates.Latitude': { value: '40.7128° N' },
'Address.Coordinates.Longitude': { value: '74.0060° W' },
},
],
metadata: undefined,
},
orders: {
headers: ORDER_HEADERS,
data: [
{
ID: { value: "1234" },
Amount: { value: 5678 }
},
{
ID: { value: "9876" },
Amount: { value: 5432 }
}
],
metadata: undefined,
},
})
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance test coverage with edge cases and type validation.

The current test case could be improved in several ways:

  1. Add test cases for empty arrays and null values
  2. Validate that numeric values (Amount) are properly typed
  3. Use more diverse test data (coordinates are identical for all contacts)

Also, consider splitting this large test into smaller, focused test cases:

test('handles empty arrays', () => {
  const emptyData = { contacts: [], orders: [] }
  const capture = parseBuffer(Buffer.from(JSON.stringify(emptyData)))
  expect(capture.contacts.data).toHaveLength(0)
  expect(capture.orders.data).toHaveLength(0)
})

test('preserves numeric types', () => {
  const capture = parseBuffer(buffer)
  expect(typeof capture.orders.data[0].Amount.value).toBe('number')
})

Comment on lines +11 to +13
test('JSON to SheetCapture', () => {
const capture = parseSheet(jsonObj)
expect(capture).toEqual({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add test cases for edge cases and error scenarios.

While the happy path is covered, consider adding test cases for:

  • Empty arrays
  • Malformed JSON
  • Deeply nested objects beyond current depth
  • Arrays with mixed data types

Would you like me to help generate additional test cases to improve coverage?

Comment on lines +311 to +321
const flattenObject = (obj: any, parent: string = '', res: any = {}) => {
for (let key in obj) {
const propName = parent ? `${parent}.${key}` : key
if (typeof obj[key] === 'object') {
flattenObject(obj[key], propName, res)
} else {
res[propName] = obj[key]
}
}
return res
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add protection against circular references and improve performance.

The flattenObject function has several potential issues:

  1. No protection against circular references
  2. Mutates the res parameter which could cause side effects
  3. Doesn't handle arrays properly in nested structures

Consider this improved implementation:

const flattenObject = (
  obj: any,
  parent: string = '',
  seen: WeakSet<object> = new WeakSet()
): Record<string, any> => {
  const res: Record<string, any> = {};
  
  if (typeof obj !== 'object' || obj === null || seen.has(obj)) {
    return res;
  }
  
  seen.add(obj);
  
  for (const [key, value] of Object.entries(obj)) {
    const propName = parent ? `${parent}.${key}` : key;
    
    if (Array.isArray(value)) {
      res[propName] = value;
    } else if (typeof value === 'object' && value !== null) {
      Object.assign(res, flattenObject(value, propName, seen));
    } else {
      res[propName] = value;
    }
  }
  
  return res;
};

Comment on lines +304 to +306
if (filteredResults.length === 0) {
return {} as SheetCapture
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Return a properly structured empty SheetCapture.

The empty object return doesn't match the SheetCapture type structure. This could cause runtime errors.

-      return {} as SheetCapture
+      return {
+        headers: [],
+        data: [],
+      } as SheetCapture
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (filteredResults.length === 0) {
return {} as SheetCapture
}
if (filteredResults.length === 0) {
return {
headers: [],
data: [],
} as SheetCapture
}

Comment on lines +343 to +346
} catch (error) {
console.error('An error occurred:', error)
throw error
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve error handling to avoid leaking sensitive data.

The current error handling logs the entire error object which might contain sensitive information.

  } catch (error) {
-    console.error('An error occurred:', error)
+    console.error('Error parsing sheet:', error instanceof Error ? error.message : 'Unknown error')
    throw new Error('Failed to parse sheet data')
  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} catch (error) {
console.error('An error occurred:', error)
throw error
}
} catch (error) {
console.error('Error parsing sheet:', error instanceof Error ? error.message : 'Unknown error')
throw new Error('Failed to parse sheet data')
}

@amarbakir
Copy link
Author

@carlbrugger I think we're ready for a review whenever you are! Let us know how you prefer to handle the AI review bot's suggestions. Some are already addressed here: aprimetechnology#1

Thanks!

@carlbrugger
Copy link
Contributor

@amarbakir thanks for the PR! I will look it over.

@carlbrugger
Copy link
Contributor

carlbrugger commented Nov 6, 2024

@amarbakir I really like the feature you're adding here. 🙌🏼 I integrated this functionality into the existing JSON extractor plugin in #698. Would you take a look at it?

@carlbrugger
Copy link
Contributor

Also, are you on Flatfile Developers Slack?

@amarbakir
Copy link
Author

Hey @carlbrugger ! Just seeing your comments. I just took a look now and it's super straightforward. Overall I'm a fan of having this as an extra function of the existing extractor

Working on Slack access now! I should be able to get that through APrime soon

@carlbrugger
Copy link
Contributor

Awesome! We'll get #698 released in short order so you can start using it. Thanks for your contribution!

@amarbakir
Copy link
Author

Functionality added through #698

@amarbakir amarbakir closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON extractor fails to extract any files on 0.9.0

2 participants