David Alonso
David Alonsoβ€’9mo ago

Inferring data types from sample data

I was wondering if Convex has open sourced the part of their product that infers types based on sample data retrieved for each field. Would be useful for an app I'm building!
24 Replies
David Alonso
David AlonsoOPβ€’9mo ago
we're essentially gonna query firestore databases which are schemaless and get back a json object for several documents and have to go from there
lee
leeβ€’9mo ago
yes that part is open source too, although it's not exactly separated out so it could be tricky to extract just that functionality. For example, you can npx convex import --table data data.jsonl and then npx convex export, and the inferred schema will be available in data/generated_schema.jsonl
David Alonso
David AlonsoOPβ€’9mo ago
Thanks @lee ! that sounds like the perfect example to start with! I haven't had the chance to look at your open source repos so I'm not even sure where to start looking for a source file that handles some of this logic. If the schema is being generated in the export command, how can I find the code being run when I execute that command?
jamwt
jamwtβ€’9mo ago
jamwt
jamwtβ€’9mo ago
GitHub
convex-backend/crates/shape_inference/src/lib.rs at main Β· get-conv...
Open source single-machine version of the Convex backend - get-convex/convex-backend
jamwt
jamwtβ€’9mo ago
and our co-founder @sujayakar did most of the theoretical work here (and initial implementation)
David Alonso
David AlonsoOPβ€’9mo ago
amazing! not very familiar with rust but I assume this would also work in TS?
jamwt
jamwtβ€’9mo ago
that exact codebase? or you mean, a similar inference algorithm?
David Alonso
David AlonsoOPβ€’9mo ago
a similar inference algorithm, either an existing library or porting yours
jamwt
jamwtβ€’9mo ago
you could port it, just possibly even just compile it to wasm
David Alonso
David AlonsoOPβ€’9mo ago
Just trying to find the easiest way to integrate a modified version of your code into our app, so sorry for all these questions.. is there a way i can directly use your rust library within convex (inside functions for instance?). We’re grabbing data from Firestore (also no-sql) and trying to figure out the schema of our users
sujayakar
sujayakarβ€’9mo ago
hey @David Alonso ! very cool that the shape inference algorithm would be useful for you. can you tell me a little bit about your app? do you want this process of grabbing data from firestore and inferring the schema to be entirely automated? or is it more of a manual process?
David Alonso
David AlonsoOPβ€’9mo ago
It should be automated, we want to do it during onboarding and at the users request. We store this schema in convex so we can then let the user easily use filters in our app
jamwt
jamwtβ€’9mo ago
this a complex enough codebase that the WASM route may not be a crazy idea...
David Alonso
David AlonsoOPβ€’9mo ago
okay, for context, i have 0 experience with rust and wasm and how they'd interoperate with our TS backend
sujayakar
sujayakarβ€’9mo ago
yeah, I think it'd be a pretty big lift to (1) isolate the shapes code for separate use as a library and (2) get it working w/wasm. here's an idea -- let me know if I'm understanding the flow correctly @David Alonso. 1. the user initiates an import from firestore. 2. a convex action downloads the data from firestore and imports it into convex into an empty table that has v.any() as its schema. 3. after finishing uploading all the data to the table, the action gets the current inferred schema for the table. this could either be via calling the CLI somehow (i.e. npx convex export) or potentially an undocumented API that the dashboard uses. we won't guarantee that this API is stable so it's at your own risk, but it'll work πŸ™‚ 4. we clear the table to cleanup for the next import. let me know if that seems promising and we can flesh out the details!
David Alonso
David AlonsoOPβ€’9mo ago
Hey @sujayakar I so appreciate your thoughts on this! This flow sounds pretty clear to me! We're on the prototype stage so we'd be happy to use an undocumented API, lmk how I can test it out Two main questions/concerns: 1. How well the unmodified library/API/CLI command will work for data coming from firestore which includes some of their proprietary types like refs, timestamps, etc. I'm pretty sure with some post processing we should be able to get it to work in the action, will test this today hopefully. 2. The privacy of our user's data is of utmost importance, so I'm wondering if there's any guarantees we can make about the user's data living on Convex strictly for the duration of execution of the action. Let me know your thoughts especially on 2 πŸ™‚ and the API Also curious about this
No description
sujayakar
sujayakarβ€’9mo ago
both great questions! for (1), you'll need to convert the data types on https://firebase.google.com/docs/firestore/manage-data/data-types to corresponding data types in convex (https://docs.convex.dev/database/types). so, for example, you could map dates to strings, geographical points to {lat: v.number(), long: v.number()} objects, maps to objects, references to strings, and vectors to v.array(v.number()). for (2), I would recommend writing down in a table that you're starting an import job. then, in case the import job crashes and leaves the table hanging around, you can have a cron that periodically looks at jobs, times out ones that look like they're stuck, and then deletes their rows. I have to run to a meeting right now, but I'll give you some pointers on the API the dashboard uses shortly. okay, so for querying the tables' inferred schema (i.e. its "shape"), the dashboard uses an internal /api/shapes2 endpoint. the implementation of it is open source: https://github.com/get-convex/convex/blob/ce25fa7bd011014efcfff617c3c7fcab882d4ac9/crates/local_backend/src/dashboard.rs#L45 you can call this from an action with just fetch and a deploy key. I'd recommend creating a deploy key for your production deployment (https://docs.convex.dev/dashboard/deployments/deployment-settings#url-and-deploy-key) and then setting that secret in an environment variable (e.g. CONVEX_DEPLOY_KEY in https://docs.convex.dev/dashboard/deployments/deployment-settings#environment-variables). note that this will only work with a production or preview deployment. if you'd like this to work with a development deployment, let me know. then, you can call this endpoint with something like...
export const loadShapes = action(async (ctx) => {
const resp = await fetch(process.env.CONVEX_CLOUD_URL + "/api/shapes2", {
headers: {
Authorization: "Convex " + process.env.CONVEX_DEPLOY_KEY,
},
});
return await resp.json();
});
export const loadShapes = action(async (ctx) => {
const resp = await fetch(process.env.CONVEX_CLOUD_URL + "/api/shapes2", {
headers: {
Authorization: "Convex " + process.env.CONVEX_DEPLOY_KEY,
},
});
return await resp.json();
});
the response has a field per table with its shape as the value. note again that this endpoint and its data formats have no stability guarantees, so let us know before you put an app that uses it into production. but hopefully this unblocks you!
David Alonso
David AlonsoOPβ€’9mo ago
Running behind on a few things so haven't been able to test this yet, but will soon! Testing in prod should be fine for now but I'll let you know when it becomes infeasible. For (1) I actually was planinng to dump all the data from firestore directly into Convex first, extract the schema, and THEN do post processing to make sure there's no information loss. For instance if I convert refs to strings then it might be harder to identify that field as a ref to a particular document vs just a string in Firestore. Will report how this goes when I test. Any thoughts on the screenshot I shared? Not sure how I can programmatically create tables in convex. My guess is that each collection/subcollection in every users' firestore db would need a separate temporary table
lee
leeβ€’9mo ago
You can create a table by writing to it, with ctx.db.insert (or npx convex import --table if you're already using the CLI).
David Alonso
David AlonsoOPβ€’9mo ago
if I have a schema defined I can't just pass any table name to the insert function right?
No description
jamwt
jamwtβ€’9mo ago
Schemas | Convex Developer Hub
Schema validation keeps your Convex data neat and tidy. It also gives you end-to-end TypeScript type safety!
jamwt
jamwtβ€’9mo ago
You can use this to let unspecified tables to be untyped
David Alonso
David AlonsoOPβ€’9mo ago
oh awesome! that part is sorted then βœ…

Did you find this page helpful?