SpeechifyInc · emirotin · Nov 7, 2025 · Nov 1, 2025 · Nov 1, 2025 · Nov 1, 2025
diff --git a/.editorconfig b/.editorconfig
@@ -0,0 +1,12 @@
+# EditorConfig is awesome: https://EditorConfig.org
+
+# top-most EditorConfig file
+root = true
+
+[*]
+indent_style = space
+indent_size = 2
+end_of_line = lf
+charset = utf-8
+trim_trailing_whitespace = true
+insert_final_newline = true
diff --git a/.eslintrc.json b/.eslintrc.json
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -0,0 +1,48 @@
+name: CI
+
+on:
+  push:
+    branches: [master]
+  pull_request:
+    branches: ['**']
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    env:
+      NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Install pnpm
+        uses: pnpm/action-setup@v4
+        with:
+          version: 10.20.0
+          run_install: false
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: 22.14.0
+          cache: pnpm
+
+      - name: Install system dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y poppler-utils antiword unrtf tesseract-ocr tesseract-ocr-chi-sim
+
+      - name: Install dependencies
+        run: pnpm install --frozen-lockfile
+
+      - name: Lint
+        run: pnpm lint
+
+      - name: Type check
+        run: pnpm typecheck
+
+      - name: Build
+        run: pnpm build
+
+      - name: Run tests
+        run: pnpm test
diff --git a/.gitignore b/.gitignore
@@ -14,8 +14,8 @@ results
 npm-debug.log
 
 node_modules
-package-lock.json
 
 .DS_Store
 
-ignore
+dist
+tsconfig.tsbuildinfo
diff --git a/.npmignore b/.npmignore
@@ -1,3 +1,2 @@
 node_modules
-test
-.vscode
+.vscode
diff --git a/.npmrc b/.npmrc
@@ -0,0 +1,2 @@
+//npm.pkg.github.com/:_authToken=${NPM_TOKEN}
+@speechifyinc:registry=https://npm.pkg.github.com
diff --git a/README.md b/README.md
@@ -1,36 +1,29 @@
-textract
-========
-
-A fork of text extraction node module with additional fixes.
-
-[![NPM](https://nodei.co/npm/textract.png?compact=true)](https://nodei.co/npm/textract/)
-[![NPM](https://nodei.co/npm-dl/textract.png)](https://nodei.co/npm/textract/)
+# textract
 
 ## Currently Extracts...
 
-* HTML, HTM
-* ATOM, RSS
-* Markdown
-* EPUB
-* XML, XSL
-* PDF
-* DOC, DOCX
-* ODT, OTT (experimental, feedback needed!)
-* RTF
-* XLS, XLSX, XLSB, XLSM, XLTX
-* CSV
-* ODS, OTS
-* PPTX, POTX
-* ODP, OTP
-* ODG, OTG
-* PNG, JPG, GIF
-* DXF
-* `application/javascript`
-* All `text/*` mime-types.
-
-In almost all cases above, what textract cares about is the mime type.  So `.html` and `.htm`, both possessing the same mime type, will be extracted.  Other extensions that share mime types with those above should also extract successfully. For example, `application/vnd.ms-excel` is the mime type for `.xls`, but also for 5 other file types.
-
-_Does textract not extract from files of the type you need?_  Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.
+- HTML, HTM
+- ATOM, RSS
+- Markdown
+- EPUB
+- XML, XSL
+- PDF
+- DOC, DOCX
+- ODT, OTT (experimental)
+- RTF
+- XLS, XLSX, XLSB, XLSM, XLTX
+- CSV
+- ODS, OTS
+- PPTX, POTX
+- ODP, OTP
+- ODG, OTG
+- PNG, JPG, GIF
+- `application/javascript`
+- All `text/*` mime-types.
+
+In almost all cases above, what textract cares about is the mime type. So `.html` and `.htm`, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, `application/vnd.ms-excel` is the mime type for `.xls`, but also for 5 other file types.
+
+_Does textract not extract from files of the type you need?_ Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.
 
 ## Install
 
@@ -40,131 +33,47 @@ npm install textract
 
 ## Extraction Requirements
 
-Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable.  Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.
+Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.
 
-* `PDF` extraction requires `pdftotext` be installed, [link](http://www.foolabs.com/xpdf/download.html)
-* `DOC` extraction requires `antiword` be installed, [link](http://www.winfield.demon.nl/), unless on OSX in which case textutil (installed by default) is used.
-* `RTF` extraction requires `unrtf` be installed, [link](https://www.gnu.org/software/unrtf/), unless on OSX in which case textutil (installed by default) is used.
-* `PNG`, `JPG` and `GIF` require `tesseract` to be available, [link](http://code.google.com/p/tesseract-ocr/).  Images need to be pretty clear, high DPI and made almost entirely of just text for `tesseract` to be able to accurately extract the text.
-* `DXF` extraction requires `drawingtotext` be available, [link](https://github.com/davidworkman9/drawingtotext)
+- `PDF` extraction requires `pdftotext` be installed, [link](http://www.foolabs.com/xpdf/download.html)
+- `DOC` extraction requires `antiword` be installed, [link](http://www.winfield.demon.nl/), unless on OSX in which case textutil (installed by default) is used.
+- `RTF` extraction requires `unrtf` be installed, [link](https://www.gnu.org/software/unrtf/), unless on OSX in which case textutil (installed by default) is used.
+- `PNG`, `JPG` and `GIF` require `tesseract` to be available, [link](http://code.google.com/p/tesseract-ocr/). Images need to be pretty clear, high DPI and made almost entirely of just text for `tesseract` to be able to accurately extract the text.
 
 ## Configuration
 
-Configuration can be passed into textract.  The following configuration options are available
+Configuration can be passed into textract. The following configuration options are available
 
-* `preserveLineBreaks`: When using the command line this is set to `true` to preserve stdout readability. When using the library via node this is set to `false`. Pass this in as `true` and textract will not strip any line breaks.
-* `preserveOnlyMultipleLineBreaks`: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default `false`) is set to `true`, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
-* `exec`: Some extractors (dxf) use node's `exec` functionality. This setting allows for providing [config to `exec` execution](http://nodejs.org/api/child_process.html#child_process_child_process_exec_command_options_callback). One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the `exec` `maxBuffer` setting.
-* `[ext].exec`: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the `odt` extractor is what you would configure for `odt` and `odg`/`odt` etc.  Check [the extractors](https://github.com/dbashford/textract/tree/master/lib/extractors) to see which you want to specifically configure. At the bottom of each is a list of `types` for which the extractor is responsible.
-* `tesseract.lang`: A pass-through to tesseract allowing for setting of language for extraction. ex: `{ tesseract: { lang:"chi_sim" } }`
-* `tesseract.cmd`: `tesseract.lang` allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass `cmd`. `cmd` is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and `psm`, you would pass `{ tesseract: { cmd:"-l chi_sim -psm 10" } }`
-* `pdftotextOptions`: This is a proxy options object to the library textract uses for pdf extraction: [pdf-text-extract](https://github.com/nisaacson/pdf-text-extract). Options include `ownerPassword`, `userPassword` if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract `layout` default so that, instead of `layout: layout`, it uses `layout:raw`. It is not suggested you modify this without understanding what trouble that might get you in. See [this GH issue](https://github.com/dbashford/textract/issues/75) for why textract overrides that library's default.
-* `typeOverride`: Used with `fromUrl`, if set, rather than using the `content-type` from the URL request, will use the provided `typeOverride`.
-* `includeAltText`: When extracting HTML, whether or not to include `alt` text with the extracted text. By default this is `false`.
+- `preserveLineBreaks`: When using the command line this is set to `true` to preserve stdout readability. When using the library via node this is set to `false`. Pass this in as `true` and textract will not strip any line breaks.
+- `preserveOnlyMultipleLineBreaks`: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default `false`) is set to `true`, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
+- `exec`: Some extractors (doc) use node's `exec` functionality. This setting allows for providing [config to `exec` execution](http://nodejs.org/api/child_process.html#child_process_child_process_exec_command_options_callback). One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the `exec` `maxBuffer` setting.
+- `[ext].exec`: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the `odt` extractor is what you would configure for `odt` and `odg`/`odt` etc. Check [the extractors](https://github.com/dbashford/textract/tree/master/lib/extractors) to see which you want to specifically configure. At the bottom of each is a list of `types` for which the extractor is responsible.
+- `tesseract.lang`: A pass-through to tesseract allowing for setting of language for extraction. ex: `{ tesseract: { lang:"chi_sim" } }`
+- `tesseract.cmd`: `tesseract.lang` allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass `cmd`. `cmd` is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and `psm`, you would pass `{ tesseract: { cmd:"-l chi_sim --psm 10" } }`
+- `pdftotextOptions`: This is a proxy options object to the library textract uses for pdf extraction: [pdf-text-extract](https://github.com/nisaacson/pdf-text-extract). Options include `ownerPassword`, `userPassword` if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract `layout` default so that, instead of `layout: layout`, it uses `layout:raw`. It is not suggested you modify this without understanding what trouble that might get you in. See [this GH issue](https://github.com/dbashford/textract/issues/75) for why textract overrides that library's default.
+- `typeOverride`: Used with `fromUrl`, if set, rather than using the `content-type` from the URL request, will use the provided `typeOverride`.
+- `includeAltText`: When extracting HTML, whether or not to include `alt` text with the extracted text. By default this is `false`.
 
 To use this configuration at the command line, prefix each open with a `--`.
 
 Ex: `textract image.png --tesseract.lang=deu`
 
 ## Usage
 
-### Commmand Line
-
-If textract is installed gloablly, via `npm install -g textract`, then the following command will write the extracted text to the console for a file on the file system.
-
-```
-$ textract pathToFile
-```
-
-#### Flags
-
-Configuration flags can be passed into textract via the command line.
-
-```
-textract pathToFile --preserveLineBreaks false
-```
-
-Parameters like `exec.maxBuffer` can be passed as you'd expect.
-
-```
-textract pathToFile --exec.maxBuffer 500000
-```
-
-And multiple flags can be used together.
-
-```
-textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000
-```
-
-### Node
-
-#### Import
-
 ```javascript
-var textract = require('textract');
-```
-
-#### APIs
+import {extract} from 'textract';
 
-There are several ways to extract text.  For all methods, the extracted text and an error object are passed to a callback.
+extractFromBuffer(contentBuffer, mimeType, options?);
 
-`error` will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a `typeNotFound` flag will be tossed on the error object.
+// or
 
-##### File
-
-```javascript
-textract.fromFileWithPath(filePath, function( error, text ) {})
-```
-
-```javascript
-textract.fromFileWithPath(filePath, config, function( error, text ) {})
+extractFromFile("/path/to/file.docx", mimeType?, options?);
 ```
-##### File + mime type
 
-```javascript
-textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})
-```
-
-```javascript
-textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})
-```
-
-##### Buffer + mime type
-
-```javascript
-textract.fromBufferWithMime(type, buffer, function( error, text ) {})
-```
-
-```javascript
-textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})
-```
-
-##### Buffer + file name/path
-
-```javascript
-textract.fromBufferWithName(name, buffer, function( error, text ) {})
-```
-
-```javascript
-textract.fromBufferWithName(name, buffer, config, function( error, text ) {})
-```
-
-##### URL
-
-When passing a URL, the URL can either be a string, or a [node.js URL object](https://nodejs.org/api/url.html). Using the URL object allows fine grained control over the URL being used.
+## Testing Notes
 
-```javascript
-textract.fromUrl(url, function( error, text ) {})
-```
+### Running on a Mac
 
-```javascript
-textract.fromUrl(url, config, function( error, text ) {})
-```
-
-## Testing Notes
+- `brew install tesseract tesseract-lang`
 
-### Running Tests on a Mac?
-- `sudo port install tesseract-chi-sim`
-- `sudo port install tesseract-eng`
-- You will also want to disable textract's usage of textutil as the tests are based on output from antiword.
-  - Go into `/lib/extractors/{doc|doc-osx|rtf}` and modify the code under `if ( os.platform() === 'darwin' ) {`. Uncommented the commented lines in these sections.
+NOTE! The Word processing results are inconsistent between OSX and Linux (different utils are used), so the test themselves are relaxed to accomodate for both cases.
diff --git a/bin/textract b/bin/textract
diff --git a/eslint.config.mjs b/eslint.config.mjs
@@ -0,0 +1,22 @@
+import typescriptPreset from '@speechifyinc/platform-code-conformity-kit/eslint/presets/typescript-node.js';
+import prettierConfig from '@speechifyinc/platform-code-conformity-kit/eslint/configs/prettier.js';
+// import vitest from "@speechifyinc/platform-code-conformity-kit/eslint/configs/vitest.js";
+
+export default [
+  ...typescriptPreset,
+  {
+    languageOptions: {
+      parserOptions: {
+        project: ['./tsconfig.json'],
+      },
+    },
+  },
+  ...prettierConfig,
+  {
+    files: ['**/*.test.ts'],
+    rules: {
+      'n/no-unpublished-import': 'off',
+    },
+  },
+  { ignores: ['dist/**', 'test/files/**'] },
+];
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		//npm.pkg.github.com/:_authToken=${NPM_TOKEN}
		@speechifyinc:registry=https://npm.pkg.github.com