is there a light weight library on nextjs to extract/view the text content of a pdf and preview it on screen (and maybe allow editing it) - it doesnt need to be saved on the file. we only care about the text in the pdf and extract and edit it as it will be sent to backend. FIle is irreleveant
ChatGPT said:
Thought for 38 seconds For pure text-extraction and in-browser preview/editing your best bet is still Mozilla’s PDF.js (via the lighter pdfjs-dist package) rather than a full “PDF viewer” component. Here’s how you can wire it up in Next.js:
Why: you can import just the core and worker, call getTextContent() on each page, and render only the text—no canvas or heavy UI layers required. stackoverflow.com
How (high-level):
Install
bash Copy npm install pdfjs-dist Dynamic import (to avoid SSR breakage in Next.js)
js Copy import dynamic from 'next/dynamic';
// only loads on the client const pdfjsLib = dynamic( () => import('pdfjs-dist/legacy/build/pdf'), { ssr: false } ); Extract text in a React effect:
js Copy useEffect(() => { if (!fileArrayBuffer) return; (async () => { const loadingTask = pdfjsLib.getDocument({ data: fileArrayBuffer }); const pdf = await loadingTask.promise; let fullText = ''; for (let i = 1; i <= pdf.numPages; i++) { const page = await pdf.getPage(i); const { items } = await page.getTextContent(); fullText += items.map(item => item.str).join(' ') + '\n\n'; } setText(fullText); })(); }, [fileArrayBuffer]); Preview & edit
jsx Copy
<textarea value={text} onChange={e => setText(e.target.value)} style={{ width: '100%', height: '60vh' }} /> Send back your edited text to the backend however you like. 2. Server-side extraction alternatives If you prefer not to bundle any PDF code on the client, you can extract in a Next.js API route and ship only text to your UI: pdf-parse (Node.js) – a tiny wrapper around PDF.js for server use; extracts text & metadata: medium.com pdf.js-extract – a Node example-packaged PDF.js wrapper that gives you structured page/text data: npmjs.com You’d upload the PDF to an API route, use one of these to grab .text, then return that to your front end and display it in the same <textarea> setup. Why not a full “React PDF Viewer”? Components like @react-pdf-viewer or react-pdf include canvas/WebAssembly renderers and UI layers for annotations, search, thumbnails, etc.—quite heavy if all you need is text. By using just the extraction APIs you keep your bundle small and avoid unnecessary UI code. In summary: On-client: go with pdfjs-dist + getTextContent(), dynamically import in Next.js, render text in a <textarea>. On-server: use pdf-parse or pdf.js-extract in an API route and return plain text for client-side editing. Can u create a POC of this?I'll create a proof of concept for a lightweight PDF text extractor and editor using Next.js and pdfjs-dist. This solution will:
Text Extraction: When the "Extract Text" button is clicked, the app:
To use this in your project:
Install the required dependencies:
npm install pdfjs-dist@legacy
handleSendToBackend()
with your actual API endpoint.getTextContent()
method, which is more lightweight than rendering the PDF visually.The error you're encountering is related to how the PDF.js worker is being loaded. Let's fix this by properly configuring the worker and ensuring it's loaded correctly.
I've made the following changes to fix the worker loading issue:
Updated PDF.js Imports:
Worker Configuration:
Error Handling:
If you're still encountering issues, try these additional steps:
Check Dependencies:
npm install pdfjs-dist@latest
If you continue to face issues, please let me know and I can provide additional solutions or alternative approaches to extract text from PDFs.