-
Notifications
You must be signed in to change notification settings - Fork 11
validator: text summarizer #629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| <!-- START_INFOCARD --> | ||
|
|
||
| # @flatfile/plugin-enrich-summarize | ||
|
|
||
| This plugin provides automatic text summarization capabilities for Flatfile using natural language processing. It uses the compromise library to generate summaries and extract key phrases from specified fields. | ||
|
|
||
| **Event Type:** `commit:created` | ||
|
|
||
| **Supported Field Types:** `string` | ||
|
|
||
| <!-- END_INFOCARD --> | ||
|
|
||
| ## Features | ||
|
|
||
| - Automatic text summarization | ||
| - Key phrase extraction | ||
| - Configurable summary length or percentage | ||
| - Custom field mapping for content, summary, and key phrases | ||
| - Error handling for missing content | ||
|
|
||
| ## Parameters | ||
|
|
||
| #### `sheetSlug` - `string` - (required) | ||
| The slug of the sheet to apply summarization. | ||
|
|
||
| #### `contentField` - `string` - (required) | ||
| The field containing the full text content. | ||
|
|
||
| #### `summaryField` - `string` - (required) | ||
| The field to store the generated summary. | ||
|
|
||
| #### `keyPhrasesField` - `string` - (required) | ||
| The field to store extracted key phrases. | ||
|
|
||
| #### `summaryLength` - `number` - (optional) | ||
| Number of sentences in the summary. Default is 2. | ||
|
|
||
| #### `summaryPercentage` - `number` - (optional) | ||
| Percentage of content to include in summary. | ||
|
|
||
| ## Usage | ||
|
|
||
| **install** | ||
| ```bash | ||
| npm install @flatfile/plugin-enrich-summarize | ||
| ``` | ||
|
|
||
| **import** | ||
| ```javascript | ||
| import { FlatfileListener } from "@flatfile/listener"; | ||
| import { summarize } from "@flatfile/plugin-enrich-summarize"; | ||
| ``` | ||
|
|
||
| **listener.js** | ||
| ```javascript | ||
| const listener = new FlatfileListener(); | ||
|
|
||
| listener.use( | ||
| summarize({ | ||
| sheetSlug: "articles", | ||
| contentField: "full_text", | ||
| summaryField: "summary", | ||
| keyPhrasesField: "key_phrases", | ||
| summaryLength: 3 | ||
| }) | ||
| ); | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| module.exports = { | ||
| testEnvironment: 'node', | ||
|
|
||
| transform: { | ||
| '^.+\\.tsx?$': 'ts-jest', | ||
| }, | ||
| setupFiles: ['../../test/dotenv-config.js'], | ||
| setupFilesAfterEnv: [ | ||
| '../../test/betterConsoleLog.js', | ||
| '../../test/unit.cleanup.js', | ||
| ], | ||
| testTimeout: 60_000, | ||
| globalSetup: '../../test/setup-global.js', | ||
| forceExit: true, | ||
| passWithNoTests: true, | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| { | ||
| "timestamp": "2024-09-24T07-23-50-639Z", | ||
| "task": "Develop a text summarizer Flatfile Listener plugin:\n - Create a RecordHook to summarize text fields\n - Use the 'compromise' npm package for natural language processing\n - Implement extractive summarization techniques\n - Allow configuration of summary length or percentage\n - Add summarized text to a new field in the record\n - Give the user reasonable config options to specify the Sheet Slug, the Field(s) that are the text(s), whether the summarization should be done automatically", | ||
| "summary": "This code implements a Flatfile Listener plugin for text summarization. It uses the compromise library for natural language processing and provides configurable options for automatic summarization of content in specified fields.", | ||
| "steps": [ | ||
| [ | ||
| "Retrieve information about Flatfile Listeners and RecordHook.\n", | ||
| "#E1", | ||
| "PineconeAssistant", | ||
| "Provide information about Flatfile Listeners and RecordHook, including their structure and usage", | ||
| "Plan: Retrieve information about Flatfile Listeners and RecordHook.\n#E1 = PineconeAssistant[Provide information about Flatfile Listeners and RecordHook, including their structure and usage]" | ||
| ], | ||
| [ | ||
| "Search for information about the 'compromise' npm package for natural language processing.\n", | ||
| "#E2", | ||
| "Google", | ||
| "compromise npm package natural language processing", | ||
| "Plan: Search for information about the 'compromise' npm package for natural language processing.\n#E2 = Google[compromise npm package natural language processing]" | ||
| ], | ||
| [ | ||
| "Create the basic structure of the Flatfile Listener plugin with RecordHook.\n", | ||
| "#E3", | ||
| "LLM", | ||
| "Create a basic structure for a Flatfile Listener plugin with RecordHook, using the information from #E1", | ||
| "Plan: Create the basic structure of the Flatfile Listener plugin with RecordHook.\n#E3 = LLM[Create a basic structure for a Flatfile Listener plugin with RecordHook, using the information from #E1]" | ||
| ], | ||
| [ | ||
| "Implement the text summarization logic using the 'compromise' package.\n", | ||
| "#E4", | ||
| "LLM", | ||
| "Implement extractive summarization techniques using the 'compromise' package based on the information from #E2 and integrate it into the Listener structure from #E3", | ||
| "Plan: Implement the text summarization logic using the 'compromise' package.\n#E4 = LLM[Implement extractive summarization techniques using the 'compromise' package based on the information from #E2 and integrate it into the Listener structure from #E3]" | ||
| ], | ||
| [ | ||
| "Add configuration options for summary length or percentage and field selection.\n", | ||
| "#E5", | ||
| "LLM", | ||
| "Extend the Listener code from #E4 to include configuration options for summary length or percentage, Sheet Slug, and field selection", | ||
| "Plan: Add configuration options for summary length or percentage and field selection.\n#E5 = LLM[Extend the Listener code from #E4 to include configuration options for summary length or percentage, Sheet Slug, and field selection]" | ||
| ], | ||
| [ | ||
| "Implement the logic to add the summarized text to a new field in the record.\n", | ||
| "#E6", | ||
| "LLM", | ||
| "Add code to the Listener from #E5 to create a new field in the record and populate it with the summarized text", | ||
| "Plan: Implement the logic to add the summarized text to a new field in the record.\n#E6 = LLM[Add code to the Listener from #E5 to create a new field in the record and populate it with the summarized text]" | ||
| ], | ||
| [ | ||
| "Add configuration option for automatic summarization.\n", | ||
| "#E7", | ||
| "LLM", | ||
| "Extend the Listener code from #E6 to include a configuration option for automatic summarization", | ||
| "Plan: Add configuration option for automatic summarization.\n#E7 = LLM[Extend the Listener code from #E6 to include a configuration option for automatic summarization]" | ||
| ], | ||
| [ | ||
| "Verify the Event Topics used in the Listener.\n", | ||
| "#E8", | ||
| "PineconeAssistant", | ||
| "Verify that the Event Topics used in the Listener are valid according to the event.topics.fact file", | ||
| "Plan: Verify the Event Topics used in the Listener.\n#E8 = PineconeAssistant[Verify that the Event Topics used in the Listener are valid according to the event.topics.fact file]" | ||
| ], | ||
| [ | ||
| "Finalize the Flatfile Listener plugin code.\n", | ||
| "#E9", | ||
| "LLM", | ||
| "Combine all the components from #E3, #E4, #E5, #E6, #E7, and #E8 into a complete Flatfile Listener plugin code, ensuring all imports are used and the code is valid", | ||
| "Plan: Finalize the Flatfile Listener plugin code.\n#E9 = LLM[Combine all the components from #E3, #E4, #E5, #E6, #E7, and #E8 into a complete Flatfile Listener plugin code, ensuring all imports are used and the code is valid]" | ||
| ], | ||
| [ | ||
| "Review and optimize the final code.\n", | ||
| "#E10", | ||
| "LLM", | ||
| "Review the code from #E9, remove any unused imports, validate plugin parameters, and ensure the code follows best practices for Flatfile Listener plugins", | ||
| "Plan: Review and optimize the final code.\n#E10 = LLM[Review the code from #E9, remove any unused imports, validate plugin parameters, and ensure the code follows best practices for Flatfile Listener plugins]" | ||
| ] | ||
| ], | ||
| "metrics": { | ||
| "tokens": { | ||
| "plan": 5077, | ||
| "state": 4675, | ||
| "total": 9752 | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| { | ||
| "name": "@flatfile/plugin-enrich-summarize", | ||
| "version": "0.0.0", | ||
| "description": "A Flatfile plugin for text summarization and key phrase extraction", | ||
| "registryMetadata": { | ||
| "category": "records" | ||
| }, | ||
| "engines": { | ||
| "node": ">= 16" | ||
| }, | ||
| "browser": { | ||
| "./dist/index.cjs": "./dist/index.browser.cjs", | ||
| "./dist/index.mjs": "./dist/index.browser.mjs" | ||
| }, | ||
| "exports": { | ||
| "types": "./dist/index.d.ts", | ||
| "node": { | ||
| "import": "./dist/index.mjs", | ||
| "require": "./dist/index.cjs" | ||
| }, | ||
| "browser": { | ||
| "require": "./dist/index.browser.cjs", | ||
| "import": "./dist/index.browser.mjs" | ||
| }, | ||
| "default": "./dist/index.mjs" | ||
| }, | ||
| "main": "./dist/index.cjs", | ||
| "module": "./dist/index.mjs", | ||
| "types": "./dist/index.d.ts", | ||
| "source": "./src/index.ts", | ||
| "files": [ | ||
| "dist/**" | ||
| ], | ||
| "scripts": { | ||
| "build": "rollup -c", | ||
| "build:watch": "rollup -c --watch", | ||
| "build:prod": "NODE_ENV=production rollup -c", | ||
| "check": "tsc ./**/*.ts --noEmit --esModuleInterop", | ||
| "test": "jest src/*.spec.ts --detectOpenHandles", | ||
| "test:unit": "jest src/*.spec.ts --testPathIgnorePatterns=.*\\.e2e\\.spec\\.ts$ --detectOpenHandles", | ||
| "test:e2e": "jest src/*.e2e.spec.ts --detectOpenHandles" | ||
| }, | ||
| "keywords": [ | ||
| "flatfile-plugins", | ||
| "category-enrich" | ||
| ], | ||
| "author": "Flatfile", | ||
| "repository": { | ||
| "type": "git", | ||
| "url": "https://github.com/FlatFilers/flatfile-plugins.git", | ||
| "directory": "enrich/summarize" | ||
| }, | ||
| "license": "ISC", | ||
| "dependencies": { | ||
| "@flatfile/plugin-record-hook": "^1.7.0", | ||
| "compromise": "^14.14.0" | ||
| }, | ||
| "peerDependencies": { | ||
| "@flatfile/api": "^1.4.13", | ||
| "@flatfile/listener": "^1.1.0" | ||
| }, | ||
| "devDependencies": { | ||
| "@flatfile/rollup-config": "^0.1.1" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| import { buildConfig } from '@flatfile/rollup-config' | ||
|
|
||
| const config = buildConfig({}) | ||
|
|
||
| export default config |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| export { summarize } from './summarize.plugin' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| import { extractKeyPhrases, generateSummary } from './summary.util' | ||
|
|
||
| describe('Summary Utility Functions', () => { | ||
| describe('generateSummary()', () => { | ||
| it('should generate a summary with default length', () => { | ||
| const content = | ||
| 'This is a test sentence. This is another test sentence. And a third one for good measure.' | ||
| const summary = generateSummary(content) | ||
| expect(summary).toBe( | ||
| 'This is a test sentence. ... And a third one for good measure.' | ||
| ) | ||
| }) | ||
|
|
||
| it('should generate a summary with specified length', () => { | ||
| const content = | ||
| 'First sentence. Second sentence. Third sentence. Fourth sentence. Fifth sentence.' | ||
| const summary = generateSummary(content, { summaryLength: 3 }) | ||
| expect(summary).toBe('First sentence. ... Fifth sentence.') | ||
| }) | ||
|
|
||
| it('should generate a summary with specified percentage', () => { | ||
| const content = | ||
| 'One. Two. Three. Four. Five. Six. Seven. Eight. Nine. Ten.' | ||
| const summary = generateSummary(content, { summaryPercentage: 30 }) | ||
| expect(summary).toBe('One. ... Ten.') | ||
| }) | ||
|
|
||
| it('should handle content shorter than summary length', () => { | ||
| const content = 'Short content.' | ||
| const summary = generateSummary(content, { summaryLength: 5 }) | ||
| expect(summary).toBe(content) | ||
| }) | ||
| }) | ||
|
|
||
| describe('extractKeyPhrases()', () => { | ||
| it('should extract key phrases from content', () => { | ||
| const content = 'The quick brown fox jumps over the lazy dog.' | ||
| const keyPhrases = extractKeyPhrases(content) | ||
| console.log('keyPhrases', keyPhrases) | ||
| expect(keyPhrases[0]).toContain('quick brown fox') | ||
| expect(keyPhrases[1]).toContain('lazy dog') | ||
| }) | ||
|
|
||
| it('should handle content with no key phrases', () => { | ||
| const content = '0 1 2 3 4 5 6 7 8 9' | ||
| const keyPhrases = extractKeyPhrases(content) | ||
| expect(keyPhrases).toHaveLength(0) | ||
| }) | ||
| }) | ||
| }) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| import { type FlatfileRecord, recordHook } from '@flatfile/plugin-record-hook' | ||
| import { generateSummary, extractKeyPhrases } from './summary.util' | ||
|
|
||
| interface SummarizationConfig { | ||
| sheetSlug: string | ||
| contentField: string | ||
| summaryField: string | ||
| keyPhrasesField: string | ||
| summaryLength?: number | ||
| summaryPercentage?: number | ||
| } | ||
|
|
||
| export function summarize(config: SummarizationConfig) { | ||
| return recordHook(config.sheetSlug, (record: FlatfileRecord) => { | ||
| const content = record.get(config.contentField) as string | ||
| const existingSummary = record.get(config.summaryField) as string | ||
carlbrugger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| if (!content) { | ||
| record.addError( | ||
| config.contentField, | ||
| 'Content is required for summarization' | ||
| ) | ||
| return record | ||
| } | ||
|
|
||
| if (!existingSummary) { | ||
| const summary = generateSummary(content, { | ||
| summaryLength: config.summaryLength, | ||
| summaryPercentage: config.summaryPercentage | ||
| }) | ||
| record.set(config.summaryField, summary) | ||
|
|
||
| const keyPhrases = extractKeyPhrases(content) | ||
| record.set(config.keyPhrasesField, keyPhrases.join(', ')) | ||
carlbrugger marked this conversation as resolved.
Show resolved
Hide resolved
carlbrugger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
|
|
||
| return record | ||
| }) | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| import nlp from 'compromise' | ||
|
|
||
| export interface SummaryOptions { | ||
| summaryLength?: number | ||
| summaryPercentage?: number | ||
| } | ||
|
|
||
| export function generateSummary( | ||
| content: string, | ||
| options: SummaryOptions = {} | ||
| ): string { | ||
| const doc = nlp(content) | ||
| const sentences = doc.sentences().out('array') | ||
|
|
||
| let summaryLength = options.summaryLength || 2 | ||
| if (options.summaryPercentage) { | ||
| summaryLength = Math.max( | ||
| 1, | ||
| Math.floor((sentences.length * options.summaryPercentage) / 100) | ||
| ) | ||
| } | ||
|
|
||
| if (sentences.length <= summaryLength) { | ||
| return sentences.join(' ') | ||
| } | ||
|
|
||
| const middleIndex = Math.floor(summaryLength / 2) | ||
| const firstPart = sentences.slice(0, middleIndex).join(' ') | ||
| const lastPart = sentences.slice(-middleIndex).join(' ') | ||
| return `${firstPart} ... ${lastPart}` | ||
| } | ||
carlbrugger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| export function extractKeyPhrases(content: string): string[] { | ||
| const doc = nlp(content) | ||
| // This line extracts key phrases from the content using compromise (nlp) | ||
| // It matches patterns of up to two optional adjectives followed by one or more nouns | ||
| // '#Adjective? #Adjective?' allows matching for up to two optional adjectives | ||
| // '#Noun+' matches one or more nouns | ||
| // The 'out('array')' method returns the matches as an array of strings | ||
| return doc.match('#Adjective? #Adjective? #Noun+').out('array') | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.