Omar Farooq
Omar Farooq17mo ago

Requesting pdf-parse and pdfjs-dist on Action function's node environment

These libraries are needed to process PDF content
39 Replies
Michal Srb
Michal Srb17mo ago
Hey @Omar Farooq can you add details of the errors you get when using the packages in a Node.js action? (in case we can't repro the same use case)
Omar Farooq
Omar FarooqOP17mo ago
Hello @Michal Srb Here it is
No description
Michal Srb
Michal Srb17mo ago
Just to doublecheck: have you run npm install pdf-parse ? Is it in your package.json?
Omar Farooq
Omar FarooqOP17mo ago
Overall, my dev knows to do this, but I'll double check, maybe he missed it on this. Will let you know tommorow. Thanks! Yes, @Michal Srb here's the code for the action function "use node"; ​ import { v } from "convex/values"; import { action } from "./_generated/server"; import PDFParser from "pdf-parse"; const axios = require('axios'); ​ const jsonSource = { data: [ { "186": [[59.3087, 85.9676, 354.8682, 170.0326]] } ] }; ​ export const questionAnswer = action({ args: { pdfId: v.string(), }, handler: async (ctx: any, data: any) => { ​ // const pdfPath = await ctx.storage.getUrl(data.pdfId) // console.log(pdfPath) ​ let config = { method: 'get', maxBodyLength: Infinity, url: 'https://wandering-stork-982.convex.site/getImage?storageId=c9fc513c-2856-4402-ad1b-de3daa70b5a0', }; ​ // call for get pdf const dataPdf = await axios.request(config) .then((response: any) => { return JSON.stringify(response.data) }) .catch((error: any) => { console.log(error, 'error'); }); ​ console.log(dataPdf, 'dataPdf') ​ // call for PDF read await axios.get(dataPdf, { responseType: 'arraybuffer' }).then(async (response: any) => { const pdfBuffer = Buffer.from(response.data); ​ console.log(pdfBuffer) ​ const pdfParser: any = PDFParser(pdfBuffer); ​ pdfParser.parseBuffer(pdfBuffer); ​ await new Promise(resolve => pdfParser.on('pdfParser_dataReady', resolve)); ​ for (const entry of jsonSource.data) { const [pageNumber, coordinates] = Object.entries(entry)[0]; const text = pdfParser.getRawTextContent(parseInt(pageNumber)); ​ console.log(text, coordinates, 'text, coordinates') ​ } }).catch((error: any) => { console.error('Error fetching PDF:', error); }); ​ }, }); And the error message is: Error: Unable to push deployment config to https://wandering-stork-982.convex.cloud 400 Bad Request: InvalidModules: Loading the pushed modules encountered the following error: Failed to analyze pdfAction.js: ENOENT: no such file or directory, open './test/data/05-versions-space.pdf'
Michal Srb
Michal Srb17mo ago
Hey @Omar Farooq, we're looking into this. The issue has to do with how we bundle the dependencies. You could work around it by using a fork of the package that replaces this line: https://gitlab.com/autokent/pdf-parse/-/blob/master/index.js#L6 with
let isDebugMode = false;
let isDebugMode = false;
GitLab
index.js · master · autokent / pdf-parse · GitLab
Pure javascript cross-platform module to extract texts from PDFs.
Michal Srb
Michal Srb17mo ago
@Omar Farooq @presley figured out an easier solution, you can do:
let PDFParser = require("pdf-parse/lib/pdf-parse");
let PDFParser = require("pdf-parse/lib/pdf-parse");
because the index module that tries to load a file is just reexporting this nested module. That one should work!
Omar Farooq
Omar FarooqOP17mo ago
Awesome thank you @Michal Srb, will try this and get back to you! @Michal Srb Here's what we're seeing: 'Error:' "Error: Cannot find module './pdf.js/v1.10.100/build/pdf.js'\n" + 'Require stack:\n' + '- /var/task/aws_lambda.cjs\n' + '- /var/runtime/index.mjs' @presley @Michal Srb Please let me know any update on this
Michal Srb
Michal Srb17mo ago
Hey @Omar Farooq, that errors comes from this line: https://github.com/albertcui/pdf-parse/blob/master/lib/pdf-parse.js#L63C29-L63C80 We're not yet on a version of esbuild that would support this kind of dynamic require import. I'd suggest you vendor in the code from the library (looks pretty short) and require the mozilla dependency directly.
GitHub
pdf-parse/lib/pdf-parse.js at master · albertcui/pdf-parse
Fork of https://gitlab.com/autokent/pdf-parse. Contribute to albertcui/pdf-parse development by creating an account on GitHub.
Omar Farooq
Omar FarooqOP17mo ago
Hello @Michal Srb, thank you for providing that guidance. We are running into a 1 mb limit to the PDF size that can be parsed, can I confirm that would be our limitation with action functions?
No description
presley
presley17mo ago
Hmm.. I don't see thsi limit anywhere. Can you provide some more detail on how the PDF is getting into Convex? Is this an http action that is calling a node action? Are you passing the pdf as argument? Is this getting stored in Convex as document? What is queryyy.com in this context? Is it some API you are calling?
Omar Farooq
Omar FarooqOP17mo ago
Queryyy is our project in Convex. We're storing the PDF in the Convex file storage and parsing that file in our action function
presley
presley17mo ago
I see. When are you hitting this error? Is it when you fetch or parse the pdf or when you return the response?
Omar Farooq
Omar FarooqOP17mo ago
When we are attempting to parse it
presley
presley17mo ago
If this is an error during parsing (and not downloading or returnign response) I wonder if this is some limitation of the library. Can you confirm the error comes from the parse library? If so, this must be some library specific error message. It seems werid though.
Omar Farooq
Omar FarooqOP17mo ago
I see, I will look into the issue being the library, thank you. @presley @Michal Srb We're trying another way, would appreciate any guidance on this error, thank you! "[ERROR] No loader is configured for ".node" files: node_modules/canvas/build/Release/canvas.node node_modules/canvas/lib/bindings.js:3:25: 3 │ const bindings = require('../build/Release/canvas.node')"
presley
presley17mo ago
Hi @Omar Farooq, unfortunately, this is not a library we current support since we use a es-build as bundler. We have a project to skip bundling some node dependencies, but this is likely few weeks away from shipping.
Omar Farooq
Omar FarooqOP17mo ago
Hello @presley could this have an effect?
Omar Farooq
Omar FarooqOP17mo ago
No description
presley
presley17mo ago
This would apply if you send the pdf as query, mutation or action argument or response. I need to check the exact limit but this would make sense.
Omar Farooq
Omar FarooqOP17mo ago
yes we are sending the pdf as an action arg
presley
presley17mo ago
Ah, I see. It is best to use file storage https://docs.convex.dev/file-storage
File Storage | Convex Developer Hub
Store and serve files of any type.
Omar Farooq
Omar FarooqOP17mo ago
It is stored on Convex already and we're referencing it
presley
presley17mo ago
What is the code to download it? Is it using db.get or something else? Looking above
Omar Farooq
Omar FarooqOP17mo ago
I'm getting the code right now
Omar Farooq
Omar FarooqOP17mo ago
Pastebin
"use node";​import { v } from "convex/values";import { action } fro...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
presley
presley17mo ago
Ok, if I understand this correctly. There is one action getPdfText that fetches the file from storage, splits into chunks of roughly 100KB, encodes each element with base64 (which makes it larger) and calls https://www.queryyy.com/api/parse-pdf. And then the latter throw an error Body exceeded 1mb limit? If so, the root cause of the issue is that https://www.queryyy.com/api/parse-pdf throws an error when called with a body over `1mb . As far as I can tell, the request is passed over HTTP, and not Convex value so the above limit shouldn't matter. Now, how is https://www.queryyy.com/api/parse-pdf implemented? Is that another action build on Convex?
Omar Farooq
Omar FarooqOP17mo ago
You are correct, looking into parse-pdf now. This is it: import pdfParse from 'pdf-parse'; import { NextApiResponse } from 'next'; export default async function handler(req, res) { if (req.method === 'POST') { const { file } = req.body; try { const pdfData = Buffer.from(file, 'base64'); const parsed = await pdfParse(pdfData); res.status(200).json(parsed); } catch (error) { res.status(500).json({ error: 'An error occurred while parsing the PDF.' }); } } else { res.status(405).json({ error: 'Method not allowed.' }); } }
presley
presley17mo ago
Yeah, so this looks like next.js issue. Seems like the default limit of the body size is 1mb. I think you can configure this https://stackoverflow.com/questions/68574254/body-exceeded-1mb-limit-error-in-next-js-api-route
Stack Overflow
Body exceeded 1mb limit error in Next.js API route
If I use FormData on Next.js to upload image to server I always get this error. I tried a lot but I didn't fix this. My code: const changeValue = (e) => { if (e.target.name === "avatar"...
presley
presley17mo ago
Separate, based on the code above, I am not sure how you end up with chunks that are over 1MB, likely something in the code doesn't work because chunks must be 100KB, which when you base64 encode should be 400KB? I would add some print statements to check this logic as well. (I am not sure this is the exact way to fix the config for next.js, but seem like the limit is in there)
Omar Farooq
Omar FarooqOP17mo ago
Awesome thank you so much for looking into this for us @presley . We'll work on these points.
presley
presley17mo ago
Separate, you can likely simplify the whole thing by passing the storageURL to the Next.js API endpoint and download it from there. No worries, sorry for the inconvience, we are going to get the pdf parse and all remaining legacy npm libraries working (they currently don't because of es-build incompability with Node.js).
Siraj
Siraj15mo ago
Hello!! Do we have any updates on getting PDF parser working? I'm getting same error. 😦
Omar Farooq
Omar FarooqOP15mo ago
@Siraj You mean you are getting the Body exceeded 1mb limit error code too? I put this into our backlog, so I can't offer any further help.
Siraj
Siraj15mo ago
No, I'm getting this error. Even when I use forked pdf-parse from above.
Michal Srb
Michal Srb15mo ago
Hey @Siraj did you try based on my suggestion above to vendor in the pdf-parse code and import the mozilla dependency directly? (not using the dynamic require())
Siraj
Siraj15mo ago
Hello! Thanks for the information. It didn't work by vendoring in pdf-parse code and importing mozilla dependency directly. I hit trouble with not having es export not in pdf.js or maybe I did something wrong but I finally sorted it out by using @bundled-es-modules/pdfjs-dist on my react app, extracting text in frontend and sending the text data to backend.
Siraj
Siraj15mo ago
for anyone have similar issue in the future, this should work 🙂 https://www.npmjs.com/package/@bundled-es-modules/pdfjs-dist
npm
@bundled-es-modules/pdfjs-dist
mirror of pdfjs-dist, bundled and exposed as ES module. Latest version: 3.6.172-alpha.1, last published: 5 months ago. Start using @bundled-es-modules/pdfjs-dist in your project by running npm i @bundled-es-modules/pdfjs-dist. There are 13 other projects in the npm registry using @bundled-es-modules/pdfjs-dist.
jamwt
jamwt15mo ago
thanks for sharing a solution, @Siraj !
presley
presley15mo ago
UPDATE: We have released a new beta feature in convex 1.4 (https://news.convex.dev/announcing-convex-1-4/) that will allow us to skip bundling and install the packages on the server. We believe this will likely get pdf-parse work out of the box. More details on how to use the feature https://docs.convex.dev/functions/bundling?ref=news.convex.dev#external-packages. One details is that you might need to use require("pdf-parse") instead of import "pdf-parse" since the package has some issue that makes it think it running in test mode if you use import.
Convex News
Announcing Convex 1.4
Convex 1.4 introduces a new option to install packages used in your Node action environment on the server, a variety of logging improvements, a new option to pause deployments, a new npm create convex command and more! External packages in Node.js Before 1.4, several NPM dependencies were
Bundling | Convex Developer Hub
Bundling is the process of gathering, optimizing and transpiling the JS/TS

Did you find this page helpful?