Omar•2y ago

Requesting pdf-parse and pdfjs-dist on Action function's node environment

These libraries are needed to process PDF content

39 Replies

Hey @Omar Farooq can you add details of the errors you get when using the packages in a Node.js action? (in case we can't repro the same use case)

OmarOP•2y ago

Hello @Michal Srb Here it is

Michal Srb•2y ago

Just to doublecheck: have you run npm install pdf-parse ? Is it in your package.json?

OmarOP•2y ago

Overall, my dev knows to do this, but I'll double check, maybe he missed it on this. Will let you know tommorow. Thanks! Yes, @Michal Srb here's the code for the action function

 "use node";

import { v } from "convex/values";
import { action } from "./_generated/server";
import PDFParser from "pdf-parse";
const axios = require('axios');

const jsonSource = {
    data: [
        { "186": [[59.3087, 85.9676, 354.8682, 170.0326]] }
    ]
};

export const questionAnswer = action({
    args: {
        pdfId: v.string(),
    },
    handler: async (ctx: any, data: any) => {

        // const pdfPath = await ctx.storage.getUrl(data.pdfId)
        // console.log(pdfPath)

        let config = {
            method: 'get',
            maxBodyLength: Infinity,
            url: 'https://wandering-stork-982.convex.site/getImage?storageId=c9fc513c-2856-4402-ad1b-de3daa70b5a0',
        };

        // call for get pdf
        const dataPdf = await axios.request(config)
            .then((response: any) => {
                return JSON.stringify(response.data)
            })
            .catch((error: any) => {
                console.log(error, 'error');
            });

        console.log(dataPdf, 'dataPdf')

        // call for PDF read
        await axios.get(dataPdf, { responseType: 'arraybuffer' }).then(async (response: any) => {
            const pdfBuffer = Buffer.from(response.data);

            console.log(pdfBuffer)

            const pdfParser: any = PDFParser(pdfBuffer);

            pdfParser.parseBuffer(pdfBuffer);

            await new Promise(resolve => pdfParser.on('pdfParser_dataReady', resolve));

            for (const entry of jsonSource.data) {
                const [pageNumber, coordinates] = Object.entries(entry)[0];
                const text = pdfParser.getRawTextContent(parseInt(pageNumber));

                console.log(text, coordinates, 'text, coordinates')

            }
        }).catch((error: any) => {
            console.error('Error fetching PDF:', error);
        });

    },
});

And the error message is: Error: Unable to push deployment config to https://wandering-stork-982.convex.cloud 400 Bad Request: InvalidModules: Loading the pushed modules encountered the following error: Failed to analyze pdfAction.js: ENOENT: no such file or directory, open './test/data/05-versions-space.pdf'

Michal Srb•2y ago

Hey @Omar Farooq, we're looking into this. The issue has to do with how we bundle the dependencies. You could work around it by using a fork of the package that replaces this line: https://gitlab.com/autokent/pdf-parse/-/blob/master/index.js#L6 with

let isDebugMode = false;

let isDebugMode = false;

GitLab

index.js · master · autokent / pdf-parse · GitLab

Pure javascript cross-platform module to extract texts from PDFs.

Michal Srb•2y ago

@Omar Farooq @presley figured out an easier solution, you can do:

let PDFParser = require("pdf-parse/lib/pdf-parse");

let PDFParser = require("pdf-parse/lib/pdf-parse");

because the index module that tries to load a file is just reexporting this nested module. That one should work!

OmarOP•2y ago

Awesome thank you @Michal Srb, will try this and get back to you! @Michal Srb Here's what we're seeing: 'Error:' "Error: Cannot find module './pdf.js/v1.10.100/build/pdf.js'\n" + 'Require stack:\n' + '- /var/task/aws_lambda.cjs\n' + '- /var/runtime/index.mjs' @presley @Michal Srb Please let me know any update on this

Michal Srb•2y ago

Hey @Omar Farooq, that errors comes from this line: https://github.com/albertcui/pdf-parse/blob/master/lib/pdf-parse.js#L63C29-L63C80 We're not yet on a version of esbuild that would support this kind of dynamic require import. I'd suggest you vendor in the code from the library (looks pretty short) and require the mozilla dependency directly.

GitHub

pdf-parse/lib/pdf-parse.js at master · albertcui/pdf-parse

Fork of https://gitlab.com/autokent/pdf-parse. Contribute to albertcui/pdf-parse development by creating an account on GitHub.

OmarOP•2y ago

Hello @Michal Srb, thank you for providing that guidance. We are running into a 1 mb limit to the PDF size that can be parsed, can I confirm that would be our limitation with action functions?

presley•2y ago

Hmm.. I don't see thsi limit anywhere. Can you provide some more detail on how the PDF is getting into Convex? Is this an http action that is calling a node action? Are you passing the pdf as argument? Is this getting stored in Convex as document? What is queryyy.com in this context? Is it some API you are calling?

OmarOP•2y ago

Queryyy is our project in Convex. We're storing the PDF in the Convex file storage and parsing that file in our action function

presley•2y ago

I see. When are you hitting this error? Is it when you fetch or parse the pdf or when you return the response?

OmarOP•2y ago

When we are attempting to parse it

presley•2y ago

If this is an error during parsing (and not downloading or returnign response) I wonder if this is some limitation of the library. Can you confirm the error comes from the parse library? If so, this must be some library specific error message. It seems werid though.

OmarOP•2y ago

I see, I will look into the issue being the library, thank you. @presley @Michal Srb We're trying another way, would appreciate any guidance on this error, thank you! "[ERROR] No loader is configured for ".node" files: node_modules/canvas/build/Release/canvas.node node_modules/canvas/lib/bindings.js:3:25: 3 │ const bindings = require('../build/Release/canvas.node')"

presley•2y ago

Hi @Omar Farooq, unfortunately, this is not a library we current support since we use a es-build as bundler. We have a project to skip bundling some node dependencies, but this is likely few weeks away from shipping.

OmarOP•2y ago

Hello @presley could this have an effect?

OmarOP•2y ago

presley•2y ago

This would apply if you send the pdf as query, mutation or action argument or response. I need to check the exact limit but this would make sense.

OmarOP•2y ago

yes we are sending the pdf as an action arg

presley•2y ago

Ah, I see. It is best to use file storage https://docs.convex.dev/file-storage

File Storage | Convex Developer Hub

Store and serve files of any type.

OmarOP•2y ago

It is stored on Convex already and we're referencing it

presley•2y ago

What is the code to download it? Is it using db.get or something else? Looking above

OmarOP•2y ago

I'm getting the code right now

OmarOP•2y ago

https://pastebin.com/EXXPw6By

Pastebin

"use node";import { v } from "convex/values";import { action } fro...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

presley•2y ago

Ok, if I understand this correctly. There is one action getPdfText that fetches the file from storage, splits into chunks of roughly 100KB, encodes each element with base64 (which makes it larger) and calls https://www.queryyy.com/api/parse-pdf. And then the latter throw an error Body exceeded 1mb limit? If so, the root cause of the issue is that https://www.queryyy.com/api/parse-pdf throws an error when called with a body over `1mb . As far as I can tell, the request is passed over HTTP, and not Convex value so the above limit shouldn't matter. Now, how is https://www.queryyy.com/api/parse-pdf implemented? Is that another action build on Convex?

OmarOP•2y ago

You are correct, looking into parse-pdf now. This is it: import pdfParse from 'pdf-parse'; import { NextApiResponse } from 'next'; export default async function handler(req, res) { if (req.method === 'POST') { const { file } = req.body; try { const pdfData = Buffer.from(file, 'base64'); const parsed = await pdfParse(pdfData); res.status(200).json(parsed); } catch (error) { res.status(500).json({ error: 'An error occurred while parsing the PDF.' }); } } else { res.status(405).json({ error: 'Method not allowed.' }); } }

presley•2y ago

Yeah, so this looks like next.js issue. Seems like the default limit of the body size is 1mb. I think you can configure this https://stackoverflow.com/questions/68574254/body-exceeded-1mb-limit-error-in-next-js-api-route

Stack Overflow

Body exceeded 1mb limit error in Next.js API route

If I use FormData on Next.js to upload image to server I always get this error. I tried a lot but I didn't fix this. My code: const changeValue = (e) => { if (e.target.name === "avatar"...

presley•2y ago

Separate, based on the code above, I am not sure how you end up with chunks that are over 1MB, likely something in the code doesn't work because chunks must be 100KB, which when you base64 encode should be 400KB? I would add some print statements to check this logic as well. (I am not sure this is the exact way to fix the config for next.js, but seem like the limit is in there)

OmarOP•2y ago

Awesome thank you so much for looking into this for us @presley . We'll work on these points.

presley•2y ago

Separate, you can likely simplify the whole thing by passing the storageURL to the Next.js API endpoint and download it from there. No worries, sorry for the inconvience, we are going to get the pdf parse and all remaining legacy npm libraries working (they currently don't because of es-build incompability with Node.js).

Siraj•2y ago

Hello!! Do we have any updates on getting PDF parser working? I'm getting same error. 😦

OmarOP•2y ago

@Siraj You mean you are getting the Body exceeded 1mb limit error code too? I put this into our backlog, so I can't offer any further help.

Siraj•2y ago

No, I'm getting this error. Even when I use forked pdf-parse from above.

Michal Srb•2y ago

Hey @Siraj did you try based on my suggestion above to vendor in the pdf-parse code and import the mozilla dependency directly? (not using the dynamic require())

Siraj•2y ago

Hello! Thanks for the information. It didn't work by vendoring in pdf-parse code and importing mozilla dependency directly. I hit trouble with not having es export not in pdf.js or maybe I did something wrong but I finally sorted it out by using @bundled-es-modules/pdfjs-dist on my react app, extracting text in frontend and sending the text data to backend.

Siraj•2y ago

for anyone have similar issue in the future, this should work 🙂 https://www.npmjs.com/package/@bundled-es-modules/pdfjs-dist

npm

@bundled-es-modules/pdfjs-dist

mirror of pdfjs-dist, bundled and exposed as ES module. Latest version: 3.6.172-alpha.1, last published: 5 months ago. Start using @bundled-es-modules/pdfjs-dist in your project by running npm i @bundled-es-modules/pdfjs-dist. There are 13 other projects in the npm registry using @bundled-es-modules/pdfjs-dist.

jamwt•2y ago

thanks for sharing a solution, @Siraj !

presley•2y ago

UPDATE: We have released a new beta feature in convex 1.4 (https://news.convex.dev/announcing-convex-1-4/) that will allow us to skip bundling and install the packages on the server. We believe this will likely get pdf-parse work out of the box. More details on how to use the feature https://docs.convex.dev/functions/bundling?ref=news.convex.dev#external-packages. One details is that you might need to use require("pdf-parse") instead of import "pdf-parse" since the package has some issue that makes it think it running in test mode if you use import.

Convex News

Announcing Convex 1.4

Convex 1.4 introduces a new option to install packages used in your Node action environment on the server, a variety of logging improvements, a new option to pause deployments, a new npm create convex command and more! External packages in Node.js Before 1.4, several NPM dependencies were

Bundling | Convex Developer Hub

Bundling is the process of gathering, optimizing and transpiling the JS/TS

Requesting pdf-parse and pdfjs-dist on Action function's node environment

Did you find this page helpful?