Rob
Rob8mo ago

Crawlee in Convex Action

Hi, I'm trying to run Crawlee (an open source web scraping library) in a Convex action. I was able to get my action to deploy by using the node runtime and specifying crawlee and playwright as external packages
{
"node": {
"externalPackages": ["crawlee", "playwright"]
}
}
{
"node": {
"externalPackages": ["crawlee", "playwright"]
}
}
However, when I run my action, I'm seeing this error:
log
'\x1B[32mINFO\x1B[39m \x1B[33m PlaywrightCrawler:\x1B[39m Starting the crawler.'
error
'\x1B[31mERROR\x1B[39m Memory snapshot failed.\n' +
' spawn ps ENOENT\n' +
' \x1B[90m at ChildProcess._handle.onexit (node:internal/child_process:284:19)\x1B[39m\n' +
' \x1B[90m at onErrorNT (node:internal/child_process:477:16)\x1B[39m\n' +
' \x1B[90m at processTicksAndRejections (node:internal/process/task_queues:82:21)\x1B[39m'
failure
[Request ID: aa36c2a408c7a2f7] Server Error
Uncaught Error: spawn ps ENOENT
log
'\x1B[32mINFO\x1B[39m \x1B[33m PlaywrightCrawler:\x1B[39m Starting the crawler.'
error
'\x1B[31mERROR\x1B[39m Memory snapshot failed.\n' +
' spawn ps ENOENT\n' +
' \x1B[90m at ChildProcess._handle.onexit (node:internal/child_process:284:19)\x1B[39m\n' +
' \x1B[90m at onErrorNT (node:internal/child_process:477:16)\x1B[39m\n' +
' \x1B[90m at processTicksAndRejections (node:internal/process/task_queues:82:21)\x1B[39m'
failure
[Request ID: aa36c2a408c7a2f7] Server Error
Uncaught Error: spawn ps ENOENT
Any thoughts or advice on how to troubleshoot?
No description
4 Replies
Rob
RobOP8mo ago
Here is my action code:
"use node";
// For more information, see https://crawlee.dev/
import { Configuration, PlaywrightCrawler, ProxyConfiguration } from "crawlee";
import { router } from "./routes.js";
import { action } from "./_generated/server.js";

export const runCrawlee = action({
args: {},
handler: async () => {
const startUrls = ["https://crawlee.dev/"];

const crawler = new PlaywrightCrawler(
{
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
},
new Configuration({ persistStorage: false })
);

await crawler.run(startUrls);

return await crawler.getData();
},
});
"use node";
// For more information, see https://crawlee.dev/
import { Configuration, PlaywrightCrawler, ProxyConfiguration } from "crawlee";
import { router } from "./routes.js";
import { action } from "./_generated/server.js";

export const runCrawlee = action({
args: {},
handler: async () => {
const startUrls = ["https://crawlee.dev/"];

const crawler = new PlaywrightCrawler(
{
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
},
new Configuration({ persistStorage: false })
);

await crawler.run(startUrls);

return await crawler.getData();
},
});
"use node";
import { createPlaywrightRouter } from "crawlee";

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
globs: ["https://crawlee.dev/**"],
label: "detail",
});
});

router.addHandler("detail", async ({ request, page, log, pushData }) => {
const title = await page.title();
log.info(`${title}`, { url: request.loadedUrl });

await pushData({
url: request.loadedUrl,
title,
});
});
"use node";
import { createPlaywrightRouter } from "crawlee";

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
globs: ["https://crawlee.dev/**"],
label: "detail",
});
});

router.addHandler("detail", async ({ request, page, log, pushData }) => {
const title = await page.title();
log.info(`${title}`, { url: request.loadedUrl });

await pushData({
url: request.loadedUrl,
title,
});
});
This code is essentially the template for Playwright + Typescript provided by crawlee, adapted to be in a Convex action, with some slight adjustments based on documentation of deploying on an AWS lambda https://crawlee.dev/docs/deployment/aws-cheerio
Cheerio on AWS Lambda | Crawlee
Locally, we can conveniently create a Crawlee project with npx crawlee create. In order to run this project on AWS Lambda, however, we need to do a few tweaks.
jamwt
jamwt8mo ago
Unfortunately, playwright is known unsupported in the convex action environment(s). here's a more detailed thread: https://discord.com/channels/1019350475847499849/1196023188421881908/1196023188421881908
Rob
RobOP8mo ago
Ah, I see. Thanks. Any plans to support it in the future?
jamwt
jamwt8mo ago
we do plan to support more sophisticated "build steps" for action environments, and that will probably help -- ultimately there are probably system libraries missing here to make e.g. chromium able to function

Did you find this page helpful?