-
Notifications
You must be signed in to change notification settings - Fork 11
extractor: Markdown Extractor #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,46 +1,6 @@ | ||
| import type { FlatfileListener } from '@flatfile/listener' | ||
| import { summarize } from '@flatfile/plugin-enrich-summarize' | ||
| import { configureSpace } from '@flatfile/plugin-space-configure' | ||
| import { MarkdownExtractor } from '@flatfile/plugin-markdown-extractor' | ||
|
|
||
| export default async function (listener: FlatfileListener) { | ||
| listener.use( | ||
| summarize({ | ||
| sheetSlug: 'summarization', | ||
| contentField: 'content', | ||
| summaryField: 'summary', | ||
| keyPhrasesField: 'keyPhrases', | ||
| }) | ||
| ) | ||
| listener.use( | ||
| configureSpace({ | ||
| workbooks: [ | ||
| { | ||
| name: 'Sandbox', | ||
| sheets: [ | ||
| { | ||
| name: 'Summarization', | ||
| slug: 'summarization', | ||
| fields: [ | ||
| { | ||
| key: 'content', | ||
| type: 'string', | ||
| label: 'Content', | ||
| }, | ||
| { | ||
| key: 'summary', | ||
| type: 'string', | ||
| label: 'Summary', | ||
| }, | ||
| { | ||
| key: 'keyPhrases', | ||
| type: 'string', | ||
| label: 'Key Phrases', | ||
| }, | ||
| ], | ||
| }, | ||
| ], | ||
| }, | ||
| ], | ||
| }) | ||
| ) | ||
| listener.use(MarkdownExtractor()) | ||
| } | ||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| <!-- START_INFOCARD --> | ||
|
|
||
| The `@flatfile/plugin-markdown-extractor` plugin parses Markdown files and extracts tabular data, creating sheets in Flatfile for each table found. | ||
|
|
||
| **Event Type:** | ||
| `listener.on('file:created')` | ||
|
|
||
| **Supported file types:** | ||
| `.md` | ||
|
|
||
| <!-- END_INFOCARD --> | ||
|
|
||
| > When embedding Flatfile, this plugin should be deployed in a server-side listener. [Learn more](/docs/orchestration/listeners#listener-types) | ||
|
|
||
| ## Parameters | ||
|
|
||
|
|
||
|
|
||
| #### `options.maxTables` - `default: Infinity` - `number` - (optional) | ||
| The `maxTables` parameter allows you to limit the number of tables extracted from a single Markdown file. | ||
|
|
||
| #### `options.errorHandling` - `default: "lenient"` - `"strict" | "lenient"` - (optional) | ||
| The `errorHandling` parameter determines how the plugin handles parsing errors. In 'strict' mode, it throws errors, while in 'lenient' mode, it logs warnings and skips problematic tables. | ||
|
|
||
| #### `options.debug` - `default: false` - `boolean` - (optional) | ||
| The `debug` parameter enables additional logging for troubleshooting. | ||
carlbrugger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Usage | ||
|
|
||
| Listen for a Markdown file to be uploaded to Flatfile. The platform will then extract the file automatically. Once complete, the file will be ready for import in the Files area. | ||
|
|
||
| ```bash Install | ||
| npm i @flatfile/plugin-markdown-extractor | ||
| ``` | ||
|
|
||
| ```js import | ||
| import { MarkdownExtractor } from "@flatfile/plugin-markdown-extractor"; | ||
| ``` | ||
|
|
||
| ```js listener.js | ||
| listener.use(MarkdownExtractor()); | ||
| ``` | ||
carlbrugger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Full Example | ||
|
|
||
| In this example, the `MarkdownExtractor` is initialized with custom options, and then registered as middleware with the Flatfile listener. When a Markdown file is uploaded, the plugin will extract the tabular data and process it using the extractor's parser. | ||
|
|
||
| ```javascript | ||
| import { MarkdownExtractor } from "@flatfile/plugin-markdown-extractor"; | ||
|
|
||
| export default async function (listener) { | ||
| // Define optional options for the extractor | ||
| const options = { | ||
| maxTables: 5, | ||
| errorHandling: 'strict', | ||
| debug: true | ||
| }; | ||
|
|
||
| // Initialize the Markdown extractor | ||
| const markdownExtractor = MarkdownExtractor(options); | ||
|
|
||
| // Register the extractor as a middleware for the Flatfile listener | ||
| listener.use(markdownExtractor); | ||
|
|
||
| // When a Markdown file is uploaded, the tabular data will be extracted and processed using the extractor's parser. | ||
| } | ||
| ``` | ||
|
|
||
| This plugin will create a new sheet for each table found in the Markdown file, with the table headers becoming field names and the rows becoming records. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| module.exports = { | ||
| testEnvironment: 'node', | ||
|
|
||
| transform: { | ||
| '^.+\\.tsx?$': 'ts-jest', | ||
| }, | ||
| setupFiles: ['../../test/dotenv-config.js'], | ||
carlbrugger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| setupFilesAfterEnv: [ | ||
| '../../test/betterConsoleLog.js', | ||
| '../../test/unit.cleanup.js', | ||
| ], | ||
| testTimeout: 60_000, | ||
| globalSetup: '../../test/setup-global.js', | ||
| forceExit: true, | ||
| passWithNoTests: true, | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| { | ||
| "name": "@flatfile/plugin-markdown-extractor", | ||
| "version": "0.0.1", | ||
| "url": "https://github.com/FlatFilers/flatfile-plugins/tree/main/plugins/markdown-extractor", | ||
| "description": "A plugin for parsing markdown files in Flatfile.", | ||
| "registryMetadata": { | ||
| "category": "extractors" | ||
| }, | ||
| "engines": { | ||
| "node": ">= 16" | ||
| }, | ||
| "type": "module", | ||
| "browser": { | ||
| "./dist/index.cjs": "./dist/index.browser.cjs", | ||
| "./dist/index.mjs": "./dist/index.browser.mjs" | ||
| }, | ||
| "exports": { | ||
| "types": "./dist/index.d.ts", | ||
| "node": { | ||
| "import": "./dist/index.mjs", | ||
| "require": "./dist/index.cjs" | ||
| }, | ||
| "browser": { | ||
| "require": "./dist/index.browser.cjs", | ||
| "import": "./dist/index.browser.mjs" | ||
| }, | ||
| "default": "./dist/index.mjs" | ||
| }, | ||
| "main": "./dist/index.cjs", | ||
| "module": "./dist/index.mjs", | ||
| "source": "./src/index.ts", | ||
| "types": "./dist/index.d.ts", | ||
| "files": [ | ||
| "dist/**" | ||
| ], | ||
| "scripts": { | ||
| "build": "rollup -c", | ||
| "build:watch": "rollup -c --watch", | ||
| "build:prod": "NODE_ENV=production rollup -c", | ||
| "check": "tsc ./**/*.ts --noEmit --esModuleInterop", | ||
| "test": "jest src/*.spec.ts --detectOpenHandles", | ||
| "test:unit": "jest src/*.spec.ts --testPathIgnorePatterns=.*\\.e2e\\.spec\\.ts$ --detectOpenHandles", | ||
| "test:e2e": "jest src/*.e2e.spec.ts --detectOpenHandles" | ||
| }, | ||
| "keywords": [ | ||
| "flatfile-plugins", | ||
| "category-extractors" | ||
| ], | ||
| "author": "FlatFilers", | ||
| "repository": { | ||
| "type": "git", | ||
| "url": "https://github.com/FlatFilers/flatfile-plugins.git", | ||
| "directory": "plugins/markdown-extractor" | ||
| }, | ||
| "license": "ISC", | ||
| "dependencies": { | ||
| "@flatfile/util-extractor": "^2.1.2" | ||
| }, | ||
| "devDependencies": { | ||
| "@flatfile/rollup-config": "0.1.1" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| import { buildConfig } from '@flatfile/rollup-config' | ||
|
|
||
| const config = buildConfig({}) | ||
|
|
||
| export default config |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # Complex Table Example | ||
|
|
||
| This Markdown file contains a more complex table with various data types and potential parsing challenges. | ||
|
|
||
| | Product | Price | Stock | Last Updated | Features | On Sale | | ||
| |---------|-------|-------|--------------|----------|--------| | ||
| | Laptop | $999.99 | 50 | 2023-05-01 | 15" screen, 16GB RAM | true | | ||
| | Smartphone | $599.99 | 100 | 2023-05-02 | 6.5" display, 128GB storage | false | | ||
| | Tablet | $399.99 | 75 | 2023-05-03 | 10" screen, 64GB storage | true | | ||
| | Headphones | $149.99 | 200 | 2023-05-04 | Noise-cancelling, Bluetooth 5.0 | false | | ||
| | Smart Watch | $249.99 | 30 | 2023-05-05 | Heart rate monitor, GPS | true | | ||
| | External SSD | $89.99 | 150 | 2023-05-06 | 1TB, USB 3.1 | false | | ||
|
|
||
| This table includes: | ||
| - Currency values | ||
| - Integers | ||
| - Dates | ||
| - Booleans | ||
| - Strings with commas | ||
|
|
||
| It should test the parser's ability to handle various data types and potential edge cases. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Lenient Tables Example | ||
|
|
||
| This Markdown file contains multiple tables with mismatched column counts. | ||
|
|
||
| ## Table 1: Employees | ||
|
|
||
| | ID | Name | Department | | ||
| |----|------|------------| | ||
| | 1 | John Doe | HR | | ||
| | 2 | Jane Smith | | ||
| | 3 | Mike Johnson | Finance | | ||
|
|
||
| ## Table 2: Projects | ||
|
|
||
| | Project Name | Start Date | End Date | | ||
| |--------------|------------|----------| | ||
| | Website Redesign | 2023-01-01 | 2023-06-30 | | ||
| | Mobile App | 2023-03-15 | 2023-12-31 | extra column | | ||
|
|
||
| ## Table 3: Budget | ||
|
|
||
| | Category | Q1 | Q2 | Q3 | Q4 | | ||
| |----------|----|----|----|----|----| | ||
| | Marketing | $10,000 | $15,000 | $20,000 | $25,000 | | ||
| | R&D | $50,000 | $60,000 | $70,000 | $80,000 | extra column | | ||
| | Operations | $100,000 | $110,000 | $120,000 | $130,000 | | ||
|
|
||
| End of the file. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Multiple Tables Example | ||
|
|
||
| This Markdown file contains multiple tables. | ||
|
|
||
| ## Table 1: Employees | ||
|
|
||
| | ID | Name | Department | | ||
| |----|------|------------| | ||
| | 1 | John Doe | HR | | ||
| | 2 | Jane Smith | IT | | ||
| | 3 | Mike Johnson | Finance | | ||
|
|
||
| ## Table 2: Projects | ||
|
|
||
| | Project Name | Start Date | End Date | | ||
| |--------------|------------|----------| | ||
| | Website Redesign | 2023-01-01 | 2023-06-30 | | ||
| | Mobile App | 2023-03-15 | 2023-12-31 | | ||
|
|
||
| ## Table 3: Budget | ||
|
|
||
| | Category | Q1 | Q2 | Q3 | Q4 | | ||
| |----------|----|----|----|----| | ||
| | Marketing | $10,000 | $15,000 | $20,000 | $25,000 | | ||
| | R&D | $50,000 | $60,000 | $70,000 | $80,000 | | ||
| | Operations | $100,000 | $110,000 | $120,000 | $130,000 | | ||
|
|
||
| End of the file. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # Simple Table Example | ||
|
|
||
| This is a simple Markdown file with a single table. | ||
|
|
||
| | Name | Age | City | | ||
| |------|-----|------| | ||
| | John | 30 | New York | | ||
| | Alice | 25 | London | | ||
| | Bob | 35 | Paris | | ||
|
|
||
| End of the file. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| import { Extractor } from '@flatfile/util-extractor' | ||
| import { parseBuffer } from './parser' | ||
|
|
||
| export interface MarkdownExtractorOptions { | ||
| maxTables?: number | ||
| errorHandling?: 'strict' | 'lenient' | ||
| debug?: boolean | ||
| } | ||
|
|
||
| export const MarkdownExtractor = (options: MarkdownExtractorOptions = {}) => { | ||
| return Extractor('.md', 'markdown', parseBuffer, options) | ||
| } | ||
|
|
||
| export const markdownParser = parseBuffer |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.