Skip to content

A pure TypeScript/JavaScript, cross-platform module for extracting text, images, and tabular data from PDF files.

License

Notifications You must be signed in to change notification settings

liamgib/pdf-parse

 
 

pdf-parse

A pure TypeScript/JavaScript, cross-platform module for extracting text, images, and tabular data from PDF files.

npm downloads npm version node version tests test & coverage reports biome vitest codecov


Contributing Note: When opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently. For detailed guidelines on how to contribute, report bugs, or submit pull requests, see: contributing to pdf-parse

Features

  • Supports Node.js and browsers
  • CommonJS and ESM support
  • Vulnerability and security info: security policy
  • Extract page text: getText
  • Extract embedded images: getImage
  • Render pages as images: pageToImage
  • Detect and extract tabular data: getTable
  • For additional usage examples, check the example and test folders.

Similar Packages

Installation

npm install pdf-parse
# or
pnpm add pdf-parse
# or
yarn add pdf-parse
# or
bun add pdf-parse

Basic Usage

  • High-level helper for v1 compatibility: pdf
  • Full API: PDFParse

CommonJS Example, helper for v1 compatibility

const pdf  = require('pdf-parse');
// or 
// const {pdf,PDFParse}  = require('pdf-parse');
const fs = require('fs');

const data = fs.readFileSync('test.pdf');

pdf(data).then(result=>{
    console.log(result.text);
});

getText — Extract Text

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const textResult = await parser.getText();
console.log(textResult.text);

For a complete list of configuration options, see:

Usage Examples

pageToImage — Render Page to PNG

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const result = await parser.pageToImage();

for (const pageData of result.pages) {
    const imgFileName = `page_${pageData.pageNumber}.png`;
    await writeFile(imgFileName, pageData.data, { flag: 'w' });
}

getImage — Extract Embedded Images

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const result = await parser.getImage();

for (const pageData of result.pages) {
    for (const pageImage of pageData.images) {
        const imgFileName = `page_${pageData.pageNumber}-${pageImage.fileName}.png`;
        await writeFile(imgFileName, pageImage.data, { flag: 'w' });
    }
}

getTable — Extract Tabular Data

// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';

const buffer = await readFile('test/test-01/test.pdf');

const parser = new PDFParse({ data: buffer });
const result = await parser.getTable();

for (const pageData of result.pages) {
    for (const table of pageData.tables) {
        console.log(table);
    }
}

Web / Browser

  • After running npm run build, you will find both regular and minified browser bundles in dist/browser (e.g., pdf-parse.es.js and pdf-parse.es.min.js).
  • See a minimal browser example in example/browser/pdf-parse.es.cdn.html.

Use the minified versions (.min.js) for production to reduce file size, or the regular versions for development and debugging.

You can use any of the following browser bundles depending on your module system and requirements:

  • pdf-parse.es.js or pdf-parse.es.min.js for ES modules
  • pdf-parse.umd.js or pdf-parse.umd.min.js for UMD/global usage

You can include the browser bundle directly from a CDN. Use the latest version:

Or specify a particular version:

Worker Note: In browser environments, the package sets pdfjs.GlobalWorkerOptions.workerSrc automatically when imported from the built browser bundle. If you use a custom build or host pdf.worker yourself, configure pdfjs accordingly.

About

A pure TypeScript/JavaScript, cross-platform module for extracting text, images, and tabular data from PDF files.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 96.4%
  • HTML 2.5%
  • JavaScript 1.1%