zid
zid15mo ago

Question regarding indexes

Question regarding indexes I think im a bit confused on the underpinnings on how some of this works. If im fetching documents that needs to be sorted, im wondering whether if by placing an index on that field, a more efficient sorting algorithm can be implemented. Moreover, Im assuming given that the docs will essentially be sorted ahead of time, this will reduce table scan? In practice, im trying to figure out: note: I understand that convex has the default creation time property and thus fields like postDate may be unnecessary, but this is for example sake Should i use an index in this situation?
// schema
.index("by_points", ["points"])

// app code
threads = await db
.query("threads")
.withIndex("points") // does this help?
.order("desc", "points")
.paginate(paginationOpts)
.collect();
// schema
.index("by_points", ["points"])

// app code
threads = await db
.query("threads")
.withIndex("points") // does this help?
.order("desc", "points")
.paginate(paginationOpts)
.collect();
Is this proper application of an index? No eq, but rather a gte followed by the second half of the composite index thats not explicitely applied
// schema
.index("by_postDate_points", ["postDate", "points"])

// app code
threads = await db
.query("threads")
.withIndex("by_postDate_points", (q) =>
// im assuming this is in fact an index
q.gte("postDate", timeFrameStart)
)
.order("desc", "points")
.paginate(paginationOpts)
.collect();
// schema
.index("by_postDate_points", ["postDate", "points"])

// app code
threads = await db
.query("threads")
.withIndex("by_postDate_points", (q) =>
// im assuming this is in fact an index
q.gte("postDate", timeFrameStart)
)
.order("desc", "points")
.paginate(paginationOpts)
.collect();
im trying to see whether the appropriate index should be .index("by_districtsId", ["districtsId"]) or .index("by_districtsId_postDate", ["districtsId", "postDate"])
const threads = await db
.query("threads")
.withIndex("by_districtsId_postDate", (q) =>
q.eq("districtsId", districtsId)
.gte("postDate", timeFrameStart)
)
.order("desc", "points")
paginate(paginationOpts)
.collect();
const threads = await db
.query("threads")
.withIndex("by_districtsId_postDate", (q) =>
q.eq("districtsId", districtsId)
.gte("postDate", timeFrameStart)
)
.order("desc", "points")
paginate(paginationOpts)
.collect();
25 Replies
zid
zidOP15mo ago
yes i have, reviewing it a second time right now...still not sure of a few things For instance, below you'll see im fetching using a composite index. In the range, im not specifying anything with regards to points, but im assuming that it still sorts by points (postDate first)?
threads = await db
.query("threads")
.withIndex("by_postDate_points", (q) =>
// im assuming this is in fact an index
q.gte("postDate", timeFrameStart)
)
.order("desc")
.paginate(paginationOpts)
.collect();
threads = await db
.query("threads")
.withIndex("by_postDate_points", (q) =>
// im assuming this is in fact an index
q.gte("postDate", timeFrameStart)
)
.order("desc")
.paginate(paginationOpts)
.collect();
Additionally, the below seems like its valid, but i want to make sure. From my understanding there is no way to tell from my end whether an index is being applied successfully or being ignored. Is this proper application of an index? No eq, but rather a gte followed by the second half of the composite index thats not explicitely applied
// schema
.index("by_postDate_points", ["postDate", "points"])

// app code
threads = await db
.query("threads")
.withIndex("by_postDate_points", (q) =>
// im assuming this is in fact an index
q.gte("postDate", timeFrameStart)
)
.order("desc", "points")
.paginate(paginationOpts)
.collect();
// schema
.index("by_postDate_points", ["postDate", "points"])

// app code
threads = await db
.query("threads")
.withIndex("by_postDate_points", (q) =>
// im assuming this is in fact an index
q.gte("postDate", timeFrameStart)
)
.order("desc", "points")
.paginate(paginationOpts)
.collect();
lee
lee15mo ago
but im assuming that it still sorts by points (postDate first)?
yes it sorts by postDate first, then points, even if points is not in the filter.
From my understanding there is no way to tell from my end whether an index is being applied successfully or being ignored.
if you use withIndex, the index is always applied. convex never ignores indexes (ref https://stack.convex.dev/not-sql )
Is this proper application of an index?
yes, this works, although the second argument to order is ignored. if the index is on (postDate, points) then all results will be in the order of postDate first, then points if postDate is equal
It's not you, it's SQL
SQL has been a part of computing for a very long time, and SQL-based database systems underly most of the world's applications. But recently, develope...
zid
zidOP13mo ago
Ah, thank you very much! responding to yes it sorts by postDate first, then points, even if points is not in the filter. i can confirm that postDate is applied correctly, but the sorting seems to stop there. From the 10 results, it seems to be sorted by recency, and ignores points... is there a good way to debug this? Here are a few things ive tried to understand the behavior - when using a different index, one that uses points and does not use the .gte("postDate", timeFrameStart), it sorts by points as expected ) - when removing points, and only filtering by postDate, whether thats year, day, all time, etc, it filters as expected
lee
lee13mo ago
Can you share some document values for postDate, points, and _creationTime, and the index query you're doing? The query will only be sorted by points for documents that have exactly equal postDate fields
zid
zidOP13mo ago
one moment goal: Allow users to fetch items from a timeframe allTime, sorted by a single field value points index: .index("by_districtsId_status_postDate_points", ["districtsId","status", "postDate", "points"]) query:
const threads = await db
.query("threads")
.withIndex(`by_districtsId_status_postDate_${filter}`, (q) =>
q
.eq("districtsId", districtsId)
.eq("status", "visible")
.gte("postDate", timeFrameTimestamp)
)
.order("desc")
.paginate(paginationOpts);
const threads = await db
.query("threads")
.withIndex(`by_districtsId_status_postDate_${filter}`, (q) =>
q
.eq("districtsId", districtsId)
.eq("status", "visible")
.gte("postDate", timeFrameTimestamp)
)
.order("desc")
.paginate(paginationOpts);
returned list of items / document values: doc1: - points 0 - postDate: 1707257107319 doc2: - points: 0 - postDate: 1707257082693 doc3: - points: 0 - postDate: 1705711005160 doc4: - points: 900 - postDate: 1705703265864 not sure i understand The query will only be sorted by points for documents that have exactly equal postDate fields ohh nevermind, so, essentially, this wont work... i feel like my use-case is relatively common?
lee
lee13mo ago
yeah this is sorted by postDate
zid
zidOP13mo ago
whats a workaround/solution? i cant sort it by points first, then the postDate given that for postDate, what i want is to utilize the .gte the only workaround ive made was to create my own sorting algorithm, but this can only be applied to a single page at a time (paginated queries)
lee
lee13mo ago
there isn't a great solution, since however you slice it you're asking each query to read a lot of data -- there's no way to pre-sort the data so all of the documents you want are next to each other. there are some workarounds if you're able to pre-process the data like add a boolean isRecent to each document, defined as isRecent: postDate > timeFrameTimestamp, and then index query for q.eq("isRecent", true).order("desc") with an index on ["isRecent", "points"]. or similarly you could make a postDay: Math.floor(postDate / MILLISECONDS_PER_DAY) field
zid
zidOP13mo ago
appreciate the help btw. im at least glad that i get whats happening... hmm
lee
lee13mo ago
what you really want is a way to sort your data by some field so the documents you want are contiguous
zid
zidOP13mo ago
hmm, dangit, i wonder how reddit does this. okay, just to be clear on a few things. when you say however you slice it you're asking each query to read a lot of data, you're not referencing the use of gte on the postDate field correct? Seems like index should work here fine. But just incase, you're referring to the non-working theoretical compound query were discussing? in case i forget, thank you for your help. much appreciated. i was stuck on this for a couple hours today lol. thought i had it working awhile ago, but guess not! hmm..given this predicament, im concerned that my filtering and sorting features will be rather limited. i want the user to be able to fetch the most engaging posts from this past day/week/month/year, but anything this complicated seems out of the question, yes? I mean, I have this logic built out in my app logic but again this only applies to whatever is returned from the server first. let me look more into Math.floor(postDate / MILLISECONDS_PER_DAY) approach..
lee
lee13mo ago
This does seem like a reasonable use-case, so i'll check with colleagues to see if we have a better recommendation. The Math.floor technique could get you there, but it may require merging pagination pages (query top posts from the last 7 days across 7 queries, then sort the pages together)
zid
zidOP13mo ago
Does something liek this work? Or am i running into the same issue..
const threads = await ctx.db
.query("threads")
.withIndex("by_districtsId_status_postDay_points", (q) =>
q
.eq("districtsId", districtsId)
.eq("status", "visible")
.gte("postDay", startPostDay)
.lte("postDay", endPostDay)
)
.order("desc", "points")
.collect();
const threads = await ctx.db
.query("threads")
.withIndex("by_districtsId_status_postDay_points", (q) =>
q
.eq("districtsId", districtsId)
.eq("status", "visible")
.gte("postDay", startPostDay)
.lte("postDay", endPostDay)
)
.order("desc", "points")
.collect();
lee
lee13mo ago
Same issue. Remember order only takes one argument
zid
zidOP13mo ago
got it, and oh yes, thanks ok i think im starting to understand more of the limitation here, and it seems to be rather fundamental at least with respect to design of indexes and how they work. this is an interesting engineering question; curious what, if any workarounds/suggestions may come from this here are some initial thoughts to a workaround.. one direction im thinking about is: - add a field to each document, timeFrame, and it can be assigned several literal values 10, 100, 1000, 10000, and null each representing a timeframe. 10 = day for example. - using a cron function to run each evening to process the top 100+ threads with timeFrame set to null (and passes a minimum point threshold) within a certain time period (say a month) - When the end user selects a timeFrame + points, we query with an index like by_timeFrame using gte or lte, and sort it by _creationTime/postDate. Although the returned list is not a perfect top down ordering of points, theres essentially a high level weighted algorithm happening here that (hopefully) blends in a natural series of results
jamwt
jamwt13mo ago
yeah, this is an interesting architectural problem in general almost like an interview question b/c there are tradeoffs so normally there's thinking around partitioning warm stories vs. cold ones or whatever one way to think about it is only threads with changes need to have their summaries re-calculated a lot of times the sites at scale have this specialization built into them a classic example is twitter if a user has few followers, you may denormalize the publishing of their tweets... if a user has tons of followers, you don't so in this case if you want a reasonable balance, you might have strategies for threads which are updating frequently vs. threads which are "cold", which most threads usually are I'd have to think through your proposal here, but the general gist of it (rely on some domain-specific heuristics to achieve what you want) is in spirit what these systems actually do when they're optimizing for scale. like, loosening the requirement to be mostly right but not absolutely right could potentially greatly simplify your approach here, and it's often how big sites work in practice. slightly delayed values recalulated periodically, and then prioritize recalculations for the most active threads or whatever (and "merge in" popular users on demand, relying on the fact that there are few of them and their records are probably in cache memory b/c they're fetched so frequently) convex's indexes are actually how ~all db indexes work, so the considerations are the same on any stack given your requirements. (in fact, convex's indexes are actually just MySQL indexes)
zid
zidOP13mo ago
All of this is super insightful; im loving it, thank you! Looking forward to implementing this tomorrow Close to a solution im happy with but a bit stuck on something.. With this cron job approach, I will need to process these threads in batches. To simplify this scenario, lets assume that this is simply due to the sheer volume of documents along with the runtime limitations. One way I think we can do this is use a simple shared/common timestamp to check whether a thread has been processed or not for that day. With this approach, one of the drawbacks that come to mind is we have to make sure there are enough invocations of this cron job in order to process all the threads before the day ends. Technically, there is an eventual ceiling to the number of threads that can be processed within a day, to which then I suppose wouldn’t matter significantly if we allow processing to bleed into the next day and we are also loose with our requirements. My question is, what’s a better approach to this, and would love to hear your thoughts in general regarding my approach? If it’s horrible, I’d be very pleased to know! And thank you in advance.
Michal Srb
Michal Srb13mo ago
we have to make sure there are enough invocations of this cron job in order to process all the threads before the day ends.
This should not be a problem. If it is you have a very fundamental limitation on your hands, but I imagine you should be able to handle a ton of scale before you run into this. I'd first try it out and see how much you can process in what amount time.
ian
ian13mo ago
You can also process all batches from a single cron execution: it does the first page and uses the scheduler to run itself again with the next page, passing in the pagination cursor. So with a single cron execution you can start a process that works through the whole time period the cron is responsible for. make sense?
zid
zidOP13mo ago
Agreed, thank you Michal! Hmmm that sounds quite clever, so essentially a single cron execution to initiate a chain of processing tasks, each scheduling the next until all necessary data has been processed. As long as i save and fetch the current cursor from the db, seems like it should work well. Also, looking at how errors are handled and it seems like as long as its not a developer error, convex will retry indefinitely. I think i'll have to try this, thank you Ian! Finally getting a chance to delve into this...few follow up questions if i may.. - Just came across A single function can schedule up to 1000 functions with total argument size of 8MB.. Is each recursive invocation its own instance/separate function or is every recursive call accumulated towards this 1000 fn limit? If so, no worries as this is more of an issue that arrives at scale as Michal noted, but would still like to know so i can fine tune and be aware - The following seems okay to me, but do i have the right idea here?
export const processBatch = internalMutation({
args: { cursor: v.optional(v.string()) },
handler: async ({ db,scheduler }, {cursor}) => {
try {
const batchSize = 10;

const { page,continueCursor } = await db
.query("something")
.paginate({
numItems: batchSize,
cursor: cursor ? cursor : null,
});

for(let i=0; i<page.length; i++) {
// process
}

if(page?.length > 0 || continueCursor) {
await db.patch(someId, { nextCursor: continueCursor })
await scheduler.runAfter(0,internal.something.processBatch,{cursor: continueCursor})
}

} catch (error) {
throw new Error(error);
}
},
});
export const processBatch = internalMutation({
args: { cursor: v.optional(v.string()) },
handler: async ({ db,scheduler }, {cursor}) => {
try {
const batchSize = 10;

const { page,continueCursor } = await db
.query("something")
.paginate({
numItems: batchSize,
cursor: cursor ? cursor : null,
});

for(let i=0; i<page.length; i++) {
// process
}

if(page?.length > 0 || continueCursor) {
await db.patch(someId, { nextCursor: continueCursor })
await scheduler.runAfter(0,internal.something.processBatch,{cursor: continueCursor})
}

} catch (error) {
throw new Error(error);
}
},
});
Michal Srb
Michal Srb13mo ago
Is each recursive invocation its own instance/separate function
Yes, so the invocation do not "add up" towards the limit (so more invocations, more data that can be processed). Your code looks directionally good to me!
zid
zidOP13mo ago
thats amazing, thank you Michal! i think i caused an infinite loop...had to cancel the invocation manually. The basis of my logic was the following... I'm assuming continueCursor is not a good choice for this? Looking into this further now..
export const processBatch = internalMutation({
args: { cursor: v.optional(v.string()) },
handler: async ({ db,scheduler }, {cursor}) => {
try {
const batchSize = 10;

const { page,continueCursor } = await db
.query("something")
.paginate({
numItems: batchSize,
cursor: cursor ? cursor : null,
});

for(let i=0; i<page.length; i++) {
// process
}

if(!continueCursor) return;

if(continueCursor) {
await db.patch(someId, { nextCursor: continueCursor })
await scheduler.runAfter(0,internal.something.processBatch,{cursor: continueCursor})
}

} catch (error) {
throw new Error(error);
}
},
});
export const processBatch = internalMutation({
args: { cursor: v.optional(v.string()) },
handler: async ({ db,scheduler }, {cursor}) => {
try {
const batchSize = 10;

const { page,continueCursor } = await db
.query("something")
.paginate({
numItems: batchSize,
cursor: cursor ? cursor : null,
});

for(let i=0; i<page.length; i++) {
// process
}

if(!continueCursor) return;

if(continueCursor) {
await db.patch(someId, { nextCursor: continueCursor })
await scheduler.runAfter(0,internal.something.processBatch,{cursor: continueCursor})
}

} catch (error) {
throw new Error(error);
}
},
});
hmm alright seems like isDone and page.length are the only reliable ways to stop the recursion.. Could you confirm this please? Also, how does the continueCursor continue to have a value? Does it circle back to the first item? Lastly, i am SO GLAD you guys have a manual kill switch.
lee
lee13mo ago
isDone (not page.length or continueCursor) is the only way to tell that pagination is done. continueCursor will still have a value which means "the end" which is distinct from null which means "the beginning".
zid
zidOP13mo ago
Thank you for clarifying!

Did you find this page helpful?