Skip to content
Merged

V3 #2

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
d7f2dbd
fix: do not run dir creation at the top of the utils scvript
emirotin Nov 1, 2025
931611e
version bump + changelog
emirotin Nov 1, 2025
e23ba32
Update yarn.lock to include integrity hashes for package resolutions
emirotin Nov 1, 2025
f95aeb7
Replace yarn.lock with pnpm-lock.yaml and remove package-lock.json fr…
emirotin Nov 1, 2025
08fead4
Add Volta configuration and specify package manager in package.json
emirotin Nov 1, 2025
9d32df7
convert url tests
emirotin Nov 1, 2025
39dd59e
convert buffer tests
emirotin Nov 1, 2025
0076d5b
convert cli tests
emirotin Nov 1, 2025
970b58a
convert general tests
emirotin Nov 1, 2025
0f25e27
convert extract tests
emirotin Nov 1, 2025
b4dda66
convert invalid calls tests
emirotin Nov 1, 2025
1ee442c
misx fixes
emirotin Nov 1, 2025
d898b9f
basic linting/formatting
emirotin Nov 1, 2025
052d9d3
tune config
emirotin Nov 1, 2025
8398fba
wip
emirotin Nov 3, 2025
6f4bf03
move to ESM
emirotin Nov 3, 2025
4ec1659
remove dynamic registration
emirotin Nov 3, 2025
58fcb27
Refactor extraction logic to TypeScript and update interfaces. Remove…
emirotin Nov 3, 2025
3a63c60
Refactor DOC extraction functions to improve error handling and strea…
emirotin Nov 3, 2025
6ecd924
Refactor extractor registration to use async/await for improved error…
emirotin Nov 3, 2025
3700c81
Convert DOCX extraction logic from JavaScript to TypeScript, enhancin…
emirotin Nov 4, 2025
bee3176
Refactor DXF extraction logic from JavaScript to TypeScript, implemen…
emirotin Nov 4, 2025
2722119
Convert EPUB extraction logic from JavaScript to TypeScript, implemen…
emirotin Nov 4, 2025
92178f2
Remove unused imports and constants from extract.js to streamline the…
emirotin Nov 4, 2025
0811c1d
Refactor HTML extraction logic by converting from JavaScript to TypeS…
emirotin Nov 4, 2025
511436e
Convert image extraction logic from JavaScript to TypeScript. Impleme…
emirotin Nov 4, 2025
c119300
Add @types/marked dependency and refactor Markdown extraction to Type…
emirotin Nov 4, 2025
29ba5e5
Convert ODT extraction logic from JavaScript to TypeScript, implement…
emirotin Nov 4, 2025
42873b7
Implement zip file unpacking utility and refactor DOCX and ODT extrac…
emirotin Nov 4, 2025
013ef57
Add XLS extraction logic in TypeScript
emirotin Nov 4, 2025
a02e373
Convert text extraction logic from JavaScript to TypeScript
emirotin Nov 4, 2025
5263ec7
Convert PDF extraction logic from JavaScript to TypeScript
emirotin Nov 4, 2025
6591df3
Convert PPTX extraction logic from JavaScript to TypeScript
emirotin Nov 4, 2025
1b59e00
Refactor RTF extraction logic from JavaScript to TypeScript
emirotin Nov 4, 2025
374a74c
Refactor extractors to TypeScript and enhance type definitions
emirotin Nov 4, 2025
b9c760d
Add @types/html-entities dependency and implement extract function in…
emirotin Nov 4, 2025
ef47992
Refactor extract.ts to TypeScript and enhance function signatures
emirotin Nov 5, 2025
70fe3be
Remove CLI scripts and related files from the project
emirotin Nov 5, 2025
f2a0c9d
Refactor extraction logic and update dependencies
emirotin Nov 5, 2025
ec0975d
Refactor extraction functions and update TypeScript definitions
emirotin Nov 5, 2025
86c84b8
Enhance TypeScript definitions by making options parameter optional
emirotin Nov 5, 2025
9dd7c1a
Update tests for text extraction to correct expected output
emirotin Nov 5, 2025
adc5287
Update tests to correct expected output and enhance error message ass…
emirotin Nov 5, 2025
97a2051
Update extract tests to reflect corrected expected output
emirotin Nov 5, 2025
91f6257
update epub dep, fix
emirotin Nov 5, 2025
f19866b
fix test
emirotin Nov 5, 2025
af67902
update marked
emirotin Nov 5, 2025
1bddcc0
update html-entities
emirotin Nov 5, 2025
9532619
Update mime dependency to version 4.1.0 and adjust test for mime type…
emirotin Nov 5, 2025
6994d3a
misc lint
emirotin Nov 5, 2025
0752460
fix lint
emirotin Nov 5, 2025
789e1a9
Refactor extractor functions to require options parameter
emirotin Nov 5, 2025
d138c9e
slightly iupdate ptf to text command
emirotin Nov 5, 2025
572d70b
Update pdf-text-extract dependency to version 1.5.0 and add command n…
emirotin Nov 5, 2025
ee338de
vendor pdf-text-extract
emirotin Nov 5, 2025
a85820f
Upgrade jschardet dependency to version 3.1.4 and refactor import sta…
emirotin Nov 5, 2025
66cd22a
Upgrade iconv-lite dependency to version 0.7.0 in package.json and pn…
emirotin Nov 5, 2025
c0409a4
fix .doc tests
emirotin Nov 6, 2025
fdd61d5
Update tesseract command options in README and types for consistency
emirotin Nov 6, 2025
cec086d
Merge branch 'master' into v3
emirotin Nov 6, 2025
0aba784
GHA
emirotin Nov 6, 2025
543b660
Update GitHub Actions workflow to restrict push events to the master …
emirotin Nov 6, 2025
7ad3cbf
Update .gitignore to remove .npmrc entry and add .npmrc file for GitH…
emirotin Nov 6, 2025
0a3d049
Add NPM_TOKEN environment variable to GitHub Actions workflow for pac…
emirotin Nov 6, 2025
139db59
Refactor antiword execution in DOC extractor to use `-h` flag for imp…
emirotin Nov 6, 2025
d385dec
Refactor DOC tests to account for OS-specific behavior in text extrac…
emirotin Nov 6, 2025
74e2bc9
Remove DXF extractor and related tests; update README and types for c…
emirotin Nov 6, 2025
bf85ba4
Update GitHub Actions workflow to include Chinese Simplified language…
emirotin Nov 6, 2025
786e2bb
Update text extraction tests for RTF files to use `toContain` for imp…
emirotin Nov 6, 2025
ae2c36f
Replace legacy manual XML manipulations with dedicated OpenDoc module…
emirotin Nov 6, 2025
a7bca16
update readme
emirotin Nov 6, 2025
fb16786
Merge branch 'v3' of github.com:SpeechifyInc/textract into v3
emirotin Nov 6, 2025
5058daf
Update build configuration and dependencies; modify .gitignore and .n…
emirotin Nov 6, 2025
a161ab5
Update ESLint configuration, package scripts, and .gitignore
emirotin Nov 6, 2025
4ef769b
reduce buffer/file/buffer onversions when not necessary
emirotin Nov 6, 2025
0c60a41
implement lazy initialization
emirotin Nov 6, 2025
a88f7a1
Refactor extraction functions and update package configuration
emirotin Nov 6, 2025
1fa40ba
Add linting, type checking, and build steps to GitHub Actions workflow
emirotin Nov 6, 2025
70fd5be
Remove outdated buffered extract test and enhance pdf extract test as…
emirotin Nov 6, 2025
f60a434
simplify epub code
emirotin Nov 6, 2025
94df06f
return eslint-specific fixes
emirotin Nov 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# EditorConfig is awesome: https://EditorConfig.org

# top-most EditorConfig file
root = true

[*]
indent_style = space
indent_size = 2
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
27 changes: 0 additions & 27 deletions .eslintrc.json

This file was deleted.

48 changes: 48 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: CI

on:
push:
branches: [master]
pull_request:
branches: ['**']

jobs:
test:
runs-on: ubuntu-latest
env:
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install pnpm
uses: pnpm/action-setup@v4
with:
version: 10.20.0
run_install: false

- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 22.14.0
cache: pnpm

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y poppler-utils antiword unrtf tesseract-ocr tesseract-ocr-chi-sim

- name: Install dependencies
run: pnpm install --frozen-lockfile

- name: Lint
run: pnpm lint

- name: Type check
run: pnpm typecheck

- name: Build
run: pnpm build

- name: Run tests
run: pnpm test
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ results
npm-debug.log

node_modules
package-lock.json

.DS_Store

ignore
dist
tsconfig.tsbuildinfo
3 changes: 1 addition & 2 deletions .npmignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
node_modules
test
.vscode
.vscode
2 changes: 2 additions & 0 deletions .npmrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
//npm.pkg.github.com/:_authToken=${NPM_TOKEN}
@speechifyinc:registry=https://npm.pkg.github.com
183 changes: 46 additions & 137 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,29 @@
textract
========

A fork of text extraction node module with additional fixes.

[![NPM](https://nodei.co/npm/textract.png?compact=true)](https://nodei.co/npm/textract/)
[![NPM](https://nodei.co/npm-dl/textract.png)](https://nodei.co/npm/textract/)
# textract

## Currently Extracts...

* HTML, HTM
* ATOM, RSS
* Markdown
* EPUB
* XML, XSL
* PDF
* DOC, DOCX
* ODT, OTT (experimental, feedback needed!)
* RTF
* XLS, XLSX, XLSB, XLSM, XLTX
* CSV
* ODS, OTS
* PPTX, POTX
* ODP, OTP
* ODG, OTG
* PNG, JPG, GIF
* DXF
* `application/javascript`
* All `text/*` mime-types.

In almost all cases above, what textract cares about is the mime type. So `.html` and `.htm`, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, `application/vnd.ms-excel` is the mime type for `.xls`, but also for 5 other file types.

_Does textract not extract from files of the type you need?_ Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.
- HTML, HTM
- ATOM, RSS
- Markdown
- EPUB
- XML, XSL
- PDF
- DOC, DOCX
- ODT, OTT (experimental)
- RTF
- XLS, XLSX, XLSB, XLSM, XLTX
- CSV
- ODS, OTS
- PPTX, POTX
- ODP, OTP
- ODG, OTG
- PNG, JPG, GIF
- `application/javascript`
- All `text/*` mime-types.

In almost all cases above, what textract cares about is the mime type. So `.html` and `.htm`, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, `application/vnd.ms-excel` is the mime type for `.xls`, but also for 5 other file types.

_Does textract not extract from files of the type you need?_ Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.

## Install

Expand All @@ -40,131 +33,47 @@ npm install textract

## Extraction Requirements

Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.
Note, if any of the requirements below are missing, textract will run and extract all files for types it is capable. Not having these items installed does not prevent you from using textract, it just prevents you from extracting those specific files.

* `PDF` extraction requires `pdftotext` be installed, [link](http://www.foolabs.com/xpdf/download.html)
* `DOC` extraction requires `antiword` be installed, [link](http://www.winfield.demon.nl/), unless on OSX in which case textutil (installed by default) is used.
* `RTF` extraction requires `unrtf` be installed, [link](https://www.gnu.org/software/unrtf/), unless on OSX in which case textutil (installed by default) is used.
* `PNG`, `JPG` and `GIF` require `tesseract` to be available, [link](http://code.google.com/p/tesseract-ocr/). Images need to be pretty clear, high DPI and made almost entirely of just text for `tesseract` to be able to accurately extract the text.
* `DXF` extraction requires `drawingtotext` be available, [link](https://github.com/davidworkman9/drawingtotext)
- `PDF` extraction requires `pdftotext` be installed, [link](http://www.foolabs.com/xpdf/download.html)
- `DOC` extraction requires `antiword` be installed, [link](http://www.winfield.demon.nl/), unless on OSX in which case textutil (installed by default) is used.
- `RTF` extraction requires `unrtf` be installed, [link](https://www.gnu.org/software/unrtf/), unless on OSX in which case textutil (installed by default) is used.
- `PNG`, `JPG` and `GIF` require `tesseract` to be available, [link](http://code.google.com/p/tesseract-ocr/). Images need to be pretty clear, high DPI and made almost entirely of just text for `tesseract` to be able to accurately extract the text.

## Configuration

Configuration can be passed into textract. The following configuration options are available
Configuration can be passed into textract. The following configuration options are available

* `preserveLineBreaks`: When using the command line this is set to `true` to preserve stdout readability. When using the library via node this is set to `false`. Pass this in as `true` and textract will not strip any line breaks.
* `preserveOnlyMultipleLineBreaks`: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default `false`) is set to `true`, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
* `exec`: Some extractors (dxf) use node's `exec` functionality. This setting allows for providing [config to `exec` execution](http://nodejs.org/api/child_process.html#child_process_child_process_exec_command_options_callback). One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the `exec` `maxBuffer` setting.
* `[ext].exec`: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the `odt` extractor is what you would configure for `odt` and `odg`/`odt` etc. Check [the extractors](https://github.com/dbashford/textract/tree/master/lib/extractors) to see which you want to specifically configure. At the bottom of each is a list of `types` for which the extractor is responsible.
* `tesseract.lang`: A pass-through to tesseract allowing for setting of language for extraction. ex: `{ tesseract: { lang:"chi_sim" } }`
* `tesseract.cmd`: `tesseract.lang` allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass `cmd`. `cmd` is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and `psm`, you would pass `{ tesseract: { cmd:"-l chi_sim -psm 10" } }`
* `pdftotextOptions`: This is a proxy options object to the library textract uses for pdf extraction: [pdf-text-extract](https://github.com/nisaacson/pdf-text-extract). Options include `ownerPassword`, `userPassword` if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract `layout` default so that, instead of `layout: layout`, it uses `layout:raw`. It is not suggested you modify this without understanding what trouble that might get you in. See [this GH issue](https://github.com/dbashford/textract/issues/75) for why textract overrides that library's default.
* `typeOverride`: Used with `fromUrl`, if set, rather than using the `content-type` from the URL request, will use the provided `typeOverride`.
* `includeAltText`: When extracting HTML, whether or not to include `alt` text with the extracted text. By default this is `false`.
- `preserveLineBreaks`: When using the command line this is set to `true` to preserve stdout readability. When using the library via node this is set to `false`. Pass this in as `true` and textract will not strip any line breaks.
- `preserveOnlyMultipleLineBreaks`: Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option (default `false`) is set to `true`, then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks.
- `exec`: Some extractors (doc) use node's `exec` functionality. This setting allows for providing [config to `exec` execution](http://nodejs.org/api/child_process.html#child_process_child_process_exec_command_options_callback). One reason you might want to provide this config is if you are dealing with very large files. You might want to increase the `exec` `maxBuffer` setting.
- `[ext].exec`: Each extractor can take specific exec config. Keep in mind many extractors are responsible for extracting multiple types, so, for instance, the `odt` extractor is what you would configure for `odt` and `odg`/`odt` etc. Check [the extractors](https://github.com/dbashford/textract/tree/master/lib/extractors) to see which you want to specifically configure. At the bottom of each is a list of `types` for which the extractor is responsible.
- `tesseract.lang`: A pass-through to tesseract allowing for setting of language for extraction. ex: `{ tesseract: { lang:"chi_sim" } }`
- `tesseract.cmd`: `tesseract.lang` allows a quick means to provide the most popular tesseract option, but if you need to configure more options, you can simply pass `cmd`. `cmd` is the string that matches the command-line options you want to pass to tesseract. For instance, to provide language and `psm`, you would pass `{ tesseract: { cmd:"-l chi_sim --psm 10" } }`
- `pdftotextOptions`: This is a proxy options object to the library textract uses for pdf extraction: [pdf-text-extract](https://github.com/nisaacson/pdf-text-extract). Options include `ownerPassword`, `userPassword` if you are extracting text from password protected PDFs. IMPORTANT: textract modifies the pdf-text-extract `layout` default so that, instead of `layout: layout`, it uses `layout:raw`. It is not suggested you modify this without understanding what trouble that might get you in. See [this GH issue](https://github.com/dbashford/textract/issues/75) for why textract overrides that library's default.
- `typeOverride`: Used with `fromUrl`, if set, rather than using the `content-type` from the URL request, will use the provided `typeOverride`.
- `includeAltText`: When extracting HTML, whether or not to include `alt` text with the extracted text. By default this is `false`.

To use this configuration at the command line, prefix each open with a `--`.

Ex: `textract image.png --tesseract.lang=deu`

## Usage

### Commmand Line

If textract is installed gloablly, via `npm install -g textract`, then the following command will write the extracted text to the console for a file on the file system.

```
$ textract pathToFile
```

#### Flags

Configuration flags can be passed into textract via the command line.

```
textract pathToFile --preserveLineBreaks false
```

Parameters like `exec.maxBuffer` can be passed as you'd expect.

```
textract pathToFile --exec.maxBuffer 500000
```

And multiple flags can be used together.

```
textract pathToFile --preserveLineBreaks false --exec.maxBuffer 500000
```

### Node

#### Import

```javascript
var textract = require('textract');
```

#### APIs
import {extract} from 'textract';

There are several ways to extract text. For all methods, the extracted text and an error object are passed to a callback.
extractFromBuffer(contentBuffer, mimeType, options?);

`error` will contain informative text about why the extraction failed. If textract does not currently extract files of the type provided, a `typeNotFound` flag will be tossed on the error object.
// or

##### File

```javascript
textract.fromFileWithPath(filePath, function( error, text ) {})
```

```javascript
textract.fromFileWithPath(filePath, config, function( error, text ) {})
extractFromFile("/path/to/file.docx", mimeType?, options?);
```
##### File + mime type

```javascript
textract.fromFileWithMimeAndPath(type, filePath, function( error, text ) {})
```

```javascript
textract.fromFileWithMimeAndPath(type, filePath, config, function( error, text ) {})
```

##### Buffer + mime type

```javascript
textract.fromBufferWithMime(type, buffer, function( error, text ) {})
```

```javascript
textract.fromBufferWithMime(type, buffer, config, function( error, text ) {})
```

##### Buffer + file name/path

```javascript
textract.fromBufferWithName(name, buffer, function( error, text ) {})
```

```javascript
textract.fromBufferWithName(name, buffer, config, function( error, text ) {})
```

##### URL

When passing a URL, the URL can either be a string, or a [node.js URL object](https://nodejs.org/api/url.html). Using the URL object allows fine grained control over the URL being used.
## Testing Notes

```javascript
textract.fromUrl(url, function( error, text ) {})
```
### Running on a Mac

```javascript
textract.fromUrl(url, config, function( error, text ) {})
```

## Testing Notes
- `brew install tesseract tesseract-lang`

### Running Tests on a Mac?
- `sudo port install tesseract-chi-sim`
- `sudo port install tesseract-eng`
- You will also want to disable textract's usage of textutil as the tests are based on output from antiword.
- Go into `/lib/extractors/{doc|doc-osx|rtf}` and modify the code under `if ( os.platform() === 'darwin' ) {`. Uncommented the commented lines in these sections.
NOTE! The Word processing results are inconsistent between OSX and Linux (different utils are used), so the test themselves are relaxed to accomodate for both cases.
32 changes: 0 additions & 32 deletions bin/textract

This file was deleted.

22 changes: 22 additions & 0 deletions eslint.config.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import typescriptPreset from '@speechifyinc/platform-code-conformity-kit/eslint/presets/typescript-node.js';
import prettierConfig from '@speechifyinc/platform-code-conformity-kit/eslint/configs/prettier.js';
// import vitest from "@speechifyinc/platform-code-conformity-kit/eslint/configs/vitest.js";

export default [
...typescriptPreset,
{
languageOptions: {
parserOptions: {
project: ['./tsconfig.json'],
},
},
},
...prettierConfig,
{
files: ['**/*.test.ts'],
rules: {
'n/no-unpublished-import': 'off',
},
},
{ ignores: ['dist/**', 'test/files/**'] },
];
Loading