Requesting pdf-parse and pdfjs-dist on Action function's node environment
These libraries are needed to process PDF content
39 Replies
Hey @Omar Farooq can you add details of the errors you get when using the packages in a Node.js action? (in case we can't repro the same use case)
Hello @Michal Srb Here it is
Just to doublecheck: have you run
npm install pdf-parse
? Is it in your package.json?Overall, my dev knows to do this, but I'll double check, maybe he missed it on this. Will let you know tommorow. Thanks!
Yes, @Michal Srb here's the code for the action function
"use node";
import { v } from "convex/values";
import { action } from "./_generated/server";
import PDFParser from "pdf-parse";
const axios = require('axios');
const jsonSource = {
data: [
{ "186": [[59.3087, 85.9676, 354.8682, 170.0326]] }
]
};
export const questionAnswer = action({
args: {
pdfId: v.string(),
},
handler: async (ctx: any, data: any) => {
// const pdfPath = await ctx.storage.getUrl(data.pdfId)
// console.log(pdfPath)
let config = {
method: 'get',
maxBodyLength: Infinity,
url: 'https://wandering-stork-982.convex.site/getImage?storageId=c9fc513c-2856-4402-ad1b-de3daa70b5a0',
};
// call for get pdf
const dataPdf = await axios.request(config)
.then((response: any) => {
return JSON.stringify(response.data)
})
.catch((error: any) => {
console.log(error, 'error');
});
console.log(dataPdf, 'dataPdf')
// call for PDF read
await axios.get(dataPdf, { responseType: 'arraybuffer' }).then(async (response: any) => {
const pdfBuffer = Buffer.from(response.data);
console.log(pdfBuffer)
const pdfParser: any = PDFParser(pdfBuffer);
pdfParser.parseBuffer(pdfBuffer);
await new Promise(resolve => pdfParser.on('pdfParser_dataReady', resolve));
for (const entry of jsonSource.data) {
const [pageNumber, coordinates] = Object.entries(entry)[0];
const text = pdfParser.getRawTextContent(parseInt(pageNumber));
console.log(text, coordinates, 'text, coordinates')
}
}).catch((error: any) => {
console.error('Error fetching PDF:', error);
});
},
});
And the error message is:
Error: Unable to push deployment config to https://wandering-stork-982.convex.cloud
400 Bad Request: InvalidModules: Loading the pushed modules encountered the following
error:
Failed to analyze pdfAction.js: ENOENT: no such file or directory, open './test/data/05-versions-space.pdf'Hey @Omar Farooq, we're looking into this. The issue has to do with how we bundle the dependencies. You could work around it by using a fork of the package that replaces this line:
https://gitlab.com/autokent/pdf-parse/-/blob/master/index.js#L6
with
GitLab
index.js · master · autokent / pdf-parse · GitLab
Pure javascript cross-platform module to extract texts from PDFs.
@Omar Farooq @presley figured out an easier solution, you can do:
because the index module that tries to load a file is just reexporting this nested module. That one should work!
Awesome thank you @Michal Srb, will try this and get back to you!
@Michal Srb Here's what we're seeing: 'Error:' "Error: Cannot find module './pdf.js/v1.10.100/build/pdf.js'\n" +
'Require stack:\n' +
'- /var/task/aws_lambda.cjs\n' +
'- /var/runtime/index.mjs'
@presley @Michal Srb Please let me know any update on this
Hey @Omar Farooq, that errors comes from this line:
https://github.com/albertcui/pdf-parse/blob/master/lib/pdf-parse.js#L63C29-L63C80
We're not yet on a version of esbuild that would support this kind of dynamic
require
import.
I'd suggest you vendor in the code from the library (looks pretty short) and require the mozilla dependency directly.GitHub
pdf-parse/lib/pdf-parse.js at master · albertcui/pdf-parse
Fork of https://gitlab.com/autokent/pdf-parse. Contribute to albertcui/pdf-parse development by creating an account on GitHub.
Hello @Michal Srb, thank you for providing that guidance. We are running into a 1 mb limit to the PDF size that can be parsed, can I confirm that would be our limitation with action functions?
Hmm.. I don't see thsi limit anywhere. Can you provide some more detail on how the PDF is getting into Convex? Is this an http action that is calling a node action? Are you passing the pdf as argument? Is this getting stored in Convex as document?
What is queryyy.com in this context? Is it some API you are calling?
Queryyy is our project in Convex. We're storing the PDF in the Convex file storage and parsing that file in our action function
I see. When are you hitting this error? Is it when you fetch or parse the pdf or when you return the response?
When we are attempting to parse it
If this is an error during parsing (and not downloading or returnign response) I wonder if this is some limitation of the library. Can you confirm the error comes from the parse library? If so, this must be some library specific error message. It seems werid though.
I see, I will look into the issue being the library, thank you.
@presley @Michal Srb We're trying another way, would appreciate any guidance on this error, thank you! "[ERROR] No loader is configured for ".node" files: node_modules/canvas/build/Release/canvas.node
node_modules/canvas/lib/bindings.js:3:25:
3 │ const bindings = require('../build/Release/canvas.node')"
Hi @Omar Farooq, unfortunately, this is not a library we current support since we use a es-build as bundler. We have a project to skip bundling some node dependencies, but this is likely few weeks away from shipping.
Hello @presley could this have an effect?
This would apply if you send the pdf as query, mutation or action argument or response. I need to check the exact limit but this would make sense.
yes we are sending the pdf as an action arg
Ah, I see. It is best to use file storage https://docs.convex.dev/file-storage
File Storage | Convex Developer Hub
Store and serve files of any type.
It is stored on Convex already and we're referencing it
What is the code to download it? Is it using db.get or something else?
Looking above
I'm getting the code right now
Pastebin
"use node";import { v } from "convex/values";import { action } fro...
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Ok, if I understand this correctly. There is one action
getPdfText
that fetches the file from storage, splits into chunks of roughly 100KB, encodes each element with base64 (which makes it larger) and calls https://www.queryyy.com/api/parse-pdf. And then the latter throw an error Body exceeded 1mb limit
? If so, the root cause of the issue is that https://www.queryyy.com/api/parse-pdf throws an error when called with a body over `1mb . As far as I can tell, the request is passed over HTTP, and not Convex value so the above limit shouldn't matter.
Now, how is https://www.queryyy.com/api/parse-pdf implemented? Is that another action build on Convex?You are correct, looking into parse-pdf now.
This is it:
import pdfParse from 'pdf-parse';
import { NextApiResponse } from 'next';
export default async function handler(req, res) {
if (req.method === 'POST') {
const { file } = req.body;
try {
const pdfData = Buffer.from(file, 'base64');
const parsed = await pdfParse(pdfData);
res.status(200).json(parsed);
} catch (error) {
res.status(500).json({ error: 'An error occurred while parsing the PDF.' });
}
} else {
res.status(405).json({ error: 'Method not allowed.' });
}
}
Yeah, so this looks like next.js issue. Seems like the default limit of the body size is 1mb. I think you can configure this https://stackoverflow.com/questions/68574254/body-exceeded-1mb-limit-error-in-next-js-api-route
Stack Overflow
Body exceeded 1mb limit error in Next.js API route
If I use FormData on Next.js to upload image to server I always get this error.
I tried a lot but I didn't fix this.
My code:
const changeValue = (e) => {
if (e.target.name === "avatar"...
Separate, based on the code above, I am not sure how you end up with chunks that are over 1MB, likely something in the code doesn't work because chunks must be 100KB, which when you base64 encode should be 400KB? I would add some print statements to check this logic as well.
(I am not sure this is the exact way to fix the config for next.js, but seem like the limit is in there)
Awesome thank you so much for looking into this for us @presley . We'll work on these points.
Separate, you can likely simplify the whole thing by passing the storageURL to the Next.js API endpoint and download it from there.
No worries, sorry for the inconvience, we are going to get the pdf parse and all remaining legacy npm libraries working (they currently don't because of es-build incompability with Node.js).
Hello!! Do we have any updates on getting PDF parser working? I'm getting same error. 😦
@Siraj You mean you are getting the Body exceeded 1mb limit error code too?
I put this into our backlog, so I can't offer any further help.
No, I'm getting this error. Even when I use forked pdf-parse from above.
Hey @Siraj did you try based on my suggestion above to vendor in the pdf-parse code and import the mozilla dependency directly? (not using the dynamic
require()
)Hello! Thanks for the information. It didn't work by vendoring in pdf-parse code and importing mozilla dependency directly. I hit trouble with not having es export not in pdf.js or maybe I did something wrong but I finally sorted it out by using
@bundled-es-modules/pdfjs-dist
on my react app, extracting text in frontend and sending the text data to backend.for anyone have similar issue in the future, this should work 🙂
https://www.npmjs.com/package/@bundled-es-modules/pdfjs-dist
npm
@bundled-es-modules/pdfjs-dist
mirror of pdfjs-dist, bundled and exposed as ES module. Latest version: 3.6.172-alpha.1, last published: 5 months ago. Start using @bundled-es-modules/pdfjs-dist in your project by running
npm i @bundled-es-modules/pdfjs-dist
. There are 13 other projects in the npm registry using @bundled-es-modules/pdfjs-dist.thanks for sharing a solution, @Siraj !
UPDATE: We have released a new beta feature in convex 1.4 (https://news.convex.dev/announcing-convex-1-4/) that will allow us to skip bundling and install the packages on the server. We believe this will likely get pdf-parse work out of the box. More details on how to use the feature https://docs.convex.dev/functions/bundling?ref=news.convex.dev#external-packages.
One details is that you might need to use
require("pdf-parse")
instead of import "pdf-parse"
since the package has some issue that makes it think it running in test mode if you use import.Convex News
Announcing Convex 1.4
Convex 1.4 introduces a new option to install packages used in your Node action environment on the server, a variety of logging improvements, a new option to pause deployments, a new npm create convex command and more!
External packages in Node.js
Before 1.4, several NPM dependencies were
Bundling | Convex Developer Hub
Bundling is the process of gathering, optimizing and transpiling the JS/TS