A pure TypeScript/JavaScript, cross-platform module for extracting text, images, and tabular data from PDF files.
Contributing Note: When opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently. For detailed guidelines on how to contribute, report bugs, or submit pull requests, see:
contributing to pdf-parse
- Supports Node.js and browsers
- CommonJS and ESM support
- Vulnerability and security info:
security policy - Extract page text:
getText - Extract embedded images:
getImage - Render pages as images:
pageToImage - Detect and extract tabular data:
getTable - For additional usage examples, check the
exampleandtestfolders.
- pdf2json — Buggy, memory leaks, uncatchable errors in some PDF files.
- j-pdfjson — Fork of pdf2json
- pdfreader — Uses pdf2json
- pdf-extract — Not cross-platform, depends on xpdf
npm install pdf-parse
# or
pnpm add pdf-parse
# or
yarn add pdf-parse
# or
bun add pdf-parseconst pdf = require('pdf-parse');
// or
// const {pdf,PDFParse} = require('pdf-parse');
const fs = require('fs');
const data = fs.readFileSync('test.pdf');
pdf(data).then(result=>{
console.log(result.text);
});// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const textResult = await parser.getText();
console.log(textResult.text);For a complete list of configuration options, see:
DocumentInitParameters- PDF.js document initialization optionsParseParameters- pdf-parse specific options
Usage Examples
- Parse password protected PDF:
test/test-06-password - Parse only specific pages:
test/test-parse-parameters - Parse embedded hyperlinks:
test/test-hyperlinks - Load PDF from URL:
test/test-types
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.pageToImage();
for (const pageData of result.pages) {
const imgFileName = `page_${pageData.pageNumber}.png`;
await writeFile(imgFileName, pageData.data, { flag: 'w' });
}// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.getImage();
for (const pageData of result.pages) {
for (const pageImage of pageData.images) {
const imgFileName = `page_${pageData.pageNumber}-${pageImage.fileName}.png`;
await writeFile(imgFileName, pageImage.data, { flag: 'w' });
}
}// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.getTable();
for (const pageData of result.pages) {
for (const table of pageData.tables) {
console.log(table);
}
}- After running
npm run build, you will find both regular and minified browser bundles indist/browser(e.g.,pdf-parse.es.jsandpdf-parse.es.min.js). - See a minimal browser example in example/browser/pdf-parse.es.cdn.html.
Use the minified versions (.min.js) for production to reduce file size, or the regular versions for development and debugging.
You can use any of the following browser bundles depending on your module system and requirements:
pdf-parse.es.jsorpdf-parse.es.min.jsfor ES modulespdf-parse.umd.jsorpdf-parse.umd.min.jsfor UMD/global usage
You can include the browser bundle directly from a CDN. Use the latest version:
- https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/browser/pdf-parse.es.min.js
- https://unpkg.com/pdf-parse@latest/dist/browser/pdf-parse.es.min.js
Or specify a particular version:
- https://cdn.jsdelivr.net/npm/pdf-parse@2.1.10/dist/browser/pdf-parse.es.min.js
- https://unpkg.com/pdf-parse@2.1.10/dist/browser/pdf-parse.es.min.js
Worker Note: In browser environments, the package sets
pdfjs.GlobalWorkerOptions.workerSrcautomatically when imported from the built browser bundle. If you use a custom build or hostpdf.workeryourself, configure pdfjs accordingly.