zid
zid2y ago

Normalization and SQL-like strategies

second/last question for the evening -- as someone who is coming from firebase and with no sql experience, im trying to determine the best way to approach the typed schema design for convex. Should i approach convex solely with typed schemas in mind, which i assume would allow me to safely and efficiently model my data with firebase paradigms in mind (collections, docs, object-oriented) or should i adopt certain principles and processes like normalization as you see in sql? Thanks in advance! EDIT: Okay, just read: • https://stack.convex.dev/convex-vs-relational-databaseshttps://stack.convex.dev/convex-vs-firebase I'll just say...things that are too good to be true makes me nervous..lol Would still love to hear thoughts/insights into this
16 Replies
Michal Srb
Michal Srb2y ago
Hey @zid, other folks might jump on this topic as it is a broad question. My quick take is: If you can (ie you are not limited by performance or some limitation in Convex - which we’d love to hear about) you can use Convex like a relational database, with normalized data modeling, relationships between tables and avoiding duplicating state, at minimum for the programming paradigm this gives you, where you don’t need to worry about updating duplicated state.
ballingt
ballingt2y ago
@zid +1, schemas and normalization via relations go great together. I start by making tables with entities I want to return to the client, then split data into multiple tables to avoid duplicating information. Convex mutations are transactional — changes to the database are all or nothing — so denormalizing isn't hard. But it's a lot of mental/code overhead to be updating things in multiple places, normalized data is nicer to work with.
zid
zidOP2y ago
@Tom @Michal Srb Appreciate the response and input. Can I instead model my data the NoSQL way (denormalize)? My impression upon reading the two articles was that there was a strong emphasis on not needing to pre-determine table architecture, etc. In my case, i pretty much have my app code built with firebase as the backend so I'm fetching with what the page/user sees in mind. In any case, it would be helpful to know the exact or at least strongly recommended paradigm to follow because of x,y,z reasons such as scalability and performance. If I'm going to adopt the mental model of SQL, then I want to be sure this is the correct way to go. But if going the SQL way is more or less a preference and there is truly no significant tradeoffs and I could architect across the entire spectrum of SQL and NoSQL, then I would want to go with NoSQL because as Convex markets, I want to focus on building the product (as much as id like to geek out on everything else) That being said, in about 10 minutes or so I'll actually be diving into the docs (just skimmed thus far + the articles) Again thank you for your input and would love to hear your thoughts once again
ballingt
ballingt2y ago
Denormalized data is fine too. There's a size limit of 1MB/record that could push you to split things up and combine them in your query functions if your data is real big. Denormalized data is great if you already have an app so you won't be changing your data access patterns — that's where normalized data is nice. An advantage of Convex is that if for some reason you want to normalize later while maintaining the same API, you can.
zid
zidOP2y ago
Amazing. Thank you Tom, is the 1MB/record the only limit of edge case I should be concerned with? For instance, with firebase, an efficient schema could look like this — there is a users collection/table, and for each user, they would each have a number of collections/tables associated with that user. In other words, each user would have x number of collections/tables dedicated to that user. This is in contrast to there only being 5 global tables with x * 5 documents. I assume the first approach is much better in terms of performance, but still, just want to make sure I’m not missing any tradeoffs here.
ballingt
ballingt2y ago
Ah great question, in Convex if you use an index on the tables it's no more efficient to use 5n tables than to use 5 tables with n rows. The way queries work, different ranges of rows/records are watched by each instance of a query running for each user. This requires adding an index on e.g. user id so that it's a consecutive range of rows being watched. good callout, Convex is opinionated here: you could in theory create millions of tables, one per user or 5 per user, but we don't want you to do that: our dashboard tools and TypeScript bindings work best with a fixed number of tables, not 5n tables.
ballingt
ballingt2y ago
@zid a teammate just pointed me to our limits page https://docs.convex.dev/production/state#limits, we limit the number of database tables to 10,000 so it wouldn't work to create a table for each user for many apps. I would hope most project would have less than 100; you shouldn't need to shard data to multiple tables yourself, we're taking care of scaling.
Status and Guarantees | Convex Developer Hub
Convex is ready for production use. You can
zid
zidOP2y ago
Hey Tom, thanks a lot for the explanation. I think I understand most of it... Few clarifying questions: The phrase "different ranges of rows/records are watched by each instance of a query running for each user" implies that each user's query will only monitor or interact with their specific data, and this is partly the way in which your indexes work? Are your "indexes" synonymous with SQL indexes? To be more confident in consturcting the data models, should diving into SQL principles be enough or maybe even specific technologies like PostgreSQL? What would you say is the best foundation of knowledge before moving confidently with something in prod? If the 5 tables with n rows is the way to go, then I assume there are no limits to the size of a table? Can a table hold trillions of rows, or does scaling mechanisms happen in the background and i dont have to worry this probelm at all? Sorry for all the SQL-ish questions but hoping this will help others as well.
ballingt
ballingt2y ago
Great questions, and inspiration for future documentation. Yeah, each query monitors/watches/subscribes to changes in its "read set;" the set of ranges of rows of each table that it needs to watch it know if its result might change. If you db.query("rooms") in a query function, changes (including insertion/deletion) to any row if the rooms table will cause the query to re-execute. If a query reads just a single row of a table, only that row needs to be watched! But if a query reads e.g. every row with an owner column of a specific user ID, the range watched by that query won't be very specific unless an index is used. "Indexes" means the same thing it means in SQL, although "synonymous" makes it sound like we're using a specific database's index feature for this, which we're not. Billions of rows in a table, absolutely. The free plan tops out at 1M but these can all be rows in the same table no problem. Trillions? That's a lot, at 1MB/row that would be an exabyte, that's "talk to us first" territory. But yeah tables can have a lot of rows. For constructing data models, most SQL design principles probably translate. PostgreSQL not so much — we do our own indexing and don't expose underlying database primitives. As for how to gain confidence to deploy, would love to talk on a call to figure this out. Until you're running a realistic load it's hard to say anything for sure, we'd love to help you get that set up.
zid
zidOP2y ago
Thank you, again. I think i understand the basics of how/why indexes are used. It helps to know that I can rely on SQL indexes for basic frame of reference. This is all new to me so this is quite helpful. I am curious as to the differences between. traditional SQL indexes and Convex's own index architecture. From what little i know, there seems to be a tradeoff when using indexes (ie: tradeoff between read perf vs write perf + storage), so theres a slight concern and curiosity there but from our conversation, im almost at the point where i just need to start tinkering. With respect to data modeling, i think the first thing i might do after reviewing docs is to write the schema for my entire app. Given Convex's flexibility + opinions, ill be looking for the sweetspot between my current schema with firebase and a more denormalized schema. Regarding the call, yes! I would love that and appreciate your efforts, im just hoping i can add enough value to the call. Feel free to let me know how I can better prepare to do that. My current thoughts around Convex is to potentially use it as my main database, replacing Firebase while at the same time, using my Firebase + Xata/Mongo setup as a backup/prevention measure. I'm still a few months away from the app being completed, but i think i'll take this week to dive into Covex and see where that leads by end of week. First things first, I'll finish reviewing docs, and then prepare the initial draft of schema. I will pm you for contact/meeting details
ballingt
ballingt2y ago
Sounds great! To be clear, Convex can be used with a schema similar to Firebase, except for the "one table per user" pattern you mentioned. Best way to prep would be using Convex a bit, eg taking your most perf sensitive queries or flows and translating them to Convex with synthetic data. Happy to chat about questions that come up.
zid
zidOP2y ago
Mm, thank you for the clarification, that's great. Yes, let me play around with Convex for a bit and I'll be sure to pick your brain some more. Thank you Tom!
jamwt
jamwt2y ago
From what little i know, there seems to be a tradeoff when using indexes (ie: tradeoff between read perf vs write perf + storage)
Yep, that's right. And it's pretty safe in general to imagine indexes working much the same the world over in anything calling itself a database SQL index, convex indexes, mongo indexes, etc. So an index will give you quick access to a record offset for a point query or a range on the read side, dramatically speeding up reads every index on a table, however, is an additional "hidden" atomic write whenever you change a record in that database, since the associated indexes will also be updated (potentially: if any indexed fields were modified) those indexes are effectively secondary tables, so yes, they increase your total storage as well but in general, indexes are essential and eating the write/storage overhead is worth it because table scanning becomes a nonstarter after only a few thousand rows. it will grind your project to a halt to find records that way
zid
zidOP2y ago
Thank you Jamie, that was very helpful. Makes sense to not be concerned about the storage overhead as this tradeoff is a simple one to reason through and quite necessary. Follow up question regarding the Convex way of doing things for something like assigning usernames, and in part related to indexes as well. In my firebase schema, i have 3 collections to store different parts of user data (3 levels of secure information), and a few fields are duplicated as well, one of them being the username field. I mention this because thus far, the only opinion that ive seen Convex has with respect to its architecture is its relational-database-like table structures and i think the following may further clarify, at least for myself, where the line is actualy drawn between SQL, NoSQL, etc. With Convex, It seems like at the very least, there are 2 ways in which i can setup usernames In the context of assigning usernames to each user: 1. creating a unique username field on one of the users tables, say userProfiles + an index on the username field 2. creating a usernames table with the same unique username field. In Firebase, I would use the username as the docId but given that Convex ids are global, we wouldnt (and shouldnt) try to touch these correct? (dont even think this is possible). In my mind, the only con with the first approach is that the userProfiles table already has an userId index, so not sure at what point these additional indexes/atomic writes would cause a significant enough issue to warrant the second approach. The second approach seems fine too? Perhaps even better if I wanted to maintain the duplication of username field on multiple tables?
Michal Srb
Michal Srb2y ago
For the 1st approach, whether you need an index on the username fields depends on whether you need to look up the documents in the table by username (which is usually the case, when for example a user is signing in and you want to look up and compare their salted password). For the 2nd approach, you are correct that Convex already provides globally unique IDs, so another identifier cannot be used as the canonical ID, but as long as you have an index in place, like in the first approach above, you should be able to use another column for the same set of operations you'd use an ID for (by first looking up the document via this column, then using its ID). It sounds like the first approach will work fine. There is a question of at what point should you split data into multiple documents. It sounds like your Firebase implementation already split information into multiple documents, and this would work well in Convex as well (linking them via foreign keys, which can be IDs or indexed usernames).
zid
zidOP2y ago
Gotcha, thank you Michal. The transition from Firebase to Convex is really smooth. Just a bit anxious given my own deadlines but thus far, the devx has been quite stellar and straightforward. Kudos to the Convex team as well, and Happy 4th!

Did you find this page helpful?