Question regarding indexes
Question regarding indexes
I think im a bit confused on the underpinnings on how some of this works.
If im fetching documents that needs to be sorted, im wondering whether if by placing an index on that field, a more efficient sorting algorithm can be implemented. Moreover, Im assuming given that the docs will essentially be sorted ahead of time, this will reduce table scan?
In practice, im trying to figure out:
note: I understand that convex has the default creation time property and thus fields like postDate may be unnecessary, but this is for example sake
Should i use an index in this situation?
Is this proper application of an index? No eq, but rather a gte followed by the second half of the composite index thats not explicitely applied
im trying to see whether the appropriate index should be
.index("by_districtsId", ["districtsId"])
or .index("by_districtsId_postDate", ["districtsId", "postDate"])
25 Replies
Have you seen this guide? https://docs.convex.dev/database/indexes/indexes-and-query-perf
Introduction to Indexes and Query Performance | Convex Developer Hub
How do I ensure my Convex database queries
yes i have, reviewing it a second time right now...still not sure of a few things
For instance, below you'll see im fetching using a composite index.
In the range, im not specifying anything with regards to points, but im assuming that it still sorts by points (postDate first)?
Additionally, the below seems like its valid, but i want to make sure. From my understanding there is no way to tell from my end whether an index is being applied successfully or being ignored.
Is this proper application of an index? No eq, but rather a gte followed by the second half of the composite index thats not explicitely applied
but im assuming that it still sorts by points (postDate first)?yes it sorts by postDate first, then points, even if points is not in the filter.
From my understanding there is no way to tell from my end whether an index is being applied successfully or being ignored.if you use
withIndex
, the index is always applied. convex never ignores indexes (ref https://stack.convex.dev/not-sql )
Is this proper application of an index?yes, this works, although the second argument to
order
is ignored. if the index is on (postDate, points)
then all results will be in the order of postDate first, then points if postDate is equalIt's not you, it's SQL
SQL has been a part of computing for a very long time, and SQL-based database systems underly most of the world's applications. But recently, develope...
Ah, thank you very much!
responding to yes it sorts by postDate first, then points, even if points is not in the filter.
i can confirm that postDate is applied correctly, but the sorting seems to stop there.
From the 10 results, it seems to be sorted by recency, and ignores points...
is there a good way to debug this?
Here are a few things ive tried to understand the behavior
- when using a different index, one that uses points and does not use the
.gte("postDate", timeFrameStart)
, it sorts by points as expected
)
- when removing points, and only filtering by postDate, whether thats year, day, all time, etc, it filters as expectedCan you share some document values for
postDate
, points
, and _creationTime
, and the index query you're doing?
The query will only be sorted by points
for documents that have exactly equal postDate
fieldsone moment
goal: Allow users to fetch items from a timeframe
allTime
, sorted by a single field value points
index:
.index("by_districtsId_status_postDate_points", ["districtsId","status", "postDate", "points"])
query:
returned list of items / document values:
doc1:
- points 0
- postDate: 1707257107319
doc2:
- points: 0
- postDate: 1707257082693
doc3:
- points: 0
- postDate: 1705711005160
doc4:
- points: 900
- postDate: 1705703265864
not sure i understand The query will only be sorted by points for documents that have exactly equal postDate fields
ohh
nevermind, so, essentially, this wont work...
i feel like my use-case is relatively common?yeah this is sorted by
postDate
whats a workaround/solution?
i cant sort it by points first, then the postDate
given that for postDate, what i want is to utilize the
.gte
the only workaround ive made was to create my own sorting algorithm, but this can only be applied to a single page at a time (paginated queries)there isn't a great solution, since however you slice it you're asking each query to read a lot of data -- there's no way to pre-sort the data so all of the documents you want are next to each other. there are some workarounds if you're able to pre-process the data like add a boolean
isRecent
to each document, defined as isRecent: postDate > timeFrameTimestamp
, and then index query for q.eq("isRecent", true).order("desc")
with an index on ["isRecent", "points"]
. or similarly you could make a postDay: Math.floor(postDate / MILLISECONDS_PER_DAY)
fieldappreciate the help btw.
im at least glad that i get whats happening...
hmm
what you really want is a way to sort your data by some field so the documents you want are contiguous
hmm, dangit, i wonder how reddit does this.
okay, just to be clear on a few things.
when you say
however you slice it you're asking each query to read a lot of data
,
you're not referencing the use of gte
on the postDate
field correct? Seems like index should work here fine. But just incase, you're referring to the non-working theoretical compound query were discussing?
in case i forget, thank you for your help. much appreciated. i was stuck on this for a couple hours today lol. thought i had it working awhile ago, but guess not!
hmm..given this predicament, im concerned that my filtering and sorting features will be rather limited.
i want the user to be able to fetch the most engaging posts from this past day/week/month/year, but anything this complicated seems out of the question, yes? I mean, I have this logic built out in my app logic but again this only applies to whatever is returned from the server first.
let me look more into Math.floor(postDate / MILLISECONDS_PER_DAY)
approach..This does seem like a reasonable use-case, so i'll check with colleagues to see if we have a better recommendation. The Math.floor technique could get you there, but it may require merging pagination pages (query top posts from the last 7 days across 7 queries, then sort the pages together)
Does something liek this work? Or am i running into the same issue..
Same issue. Remember
order
only takes one argumentgot it, and oh yes, thanks
ok i think im starting to understand more of the limitation here, and it seems to be rather fundamental at least with respect to design of indexes and how they work.
this is an interesting engineering question; curious what, if any workarounds/suggestions may come from this
here are some initial thoughts to a workaround..
one direction im thinking about is:
- add a field to each document,
timeFrame
, and it can be assigned several literal values 10
, 100
, 1000
, 10000
, and null
each representing a timeframe. 10 = day for example.
- using a cron function to run each evening to process the top 100+ threads with timeFrame set to null (and passes a minimum point threshold) within a certain time period (say a month)
- When the end user selects a timeFrame + points, we query with an index like by_timeFrame
using gte
or lte
, and sort it by _creationTime/postDate. Although the returned list is not a perfect top down ordering of points, theres essentially a high level weighted algorithm happening here that (hopefully) blends in a natural series of resultsyeah, this is an interesting architectural problem in general
almost like an interview question b/c there are tradeoffs
so normally there's thinking around partitioning
warm stories vs. cold ones or whatever
one way to think about it is only threads with changes need to have their summaries re-calculated
a lot of times the sites at scale have this specialization built into them
a classic example is twitter
if a user has few followers, you may denormalize the publishing of their tweets...
if a user has tons of followers, you don't
so in this case if you want a reasonable balance, you might have strategies for threads which are updating frequently vs. threads which are "cold", which most threads usually are
I'd have to think through your proposal here, but the general gist of it (rely on some domain-specific heuristics to achieve what you want) is in spirit what these systems actually do when they're optimizing for scale. like, loosening the requirement to be mostly right but not absolutely right could potentially greatly simplify your approach here, and it's often how big sites work in practice. slightly delayed values recalulated periodically, and then prioritize recalculations for the most active threads or whatever
(and "merge in" popular users on demand, relying on the fact that there are few of them and their records are probably in cache memory b/c they're fetched so frequently)
convex's indexes are actually how ~all db indexes work, so the considerations are the same on any stack given your requirements. (in fact, convex's indexes are actually just MySQL indexes)
All of this is super insightful; im loving it, thank you!
Looking forward to implementing this tomorrow
Close to a solution im happy with but a bit stuck on something..
With this cron job approach, I will need to process these threads in batches. To simplify this scenario, lets assume that this is simply due to the sheer volume of documents along with the runtime limitations.
One way I think we can do this is use a simple shared/common timestamp to check whether a thread has been processed or not for that day. With this approach, one of the drawbacks that come to mind is we have to make sure there are enough invocations of this cron job in order to process all the threads before the day ends. Technically, there is an eventual ceiling to the number of threads that can be processed within a day, to which then I suppose wouldn’t matter significantly if we allow processing to bleed into the next day and we are also loose with our requirements. My question is, what’s a better approach to this, and would love to hear your thoughts in general regarding my approach? If it’s horrible, I’d be very pleased to know! And thank you in advance.
we have to make sure there are enough invocations of this cron job in order to process all the threads before the day ends.This should not be a problem. If it is you have a very fundamental limitation on your hands, but I imagine you should be able to handle a ton of scale before you run into this. I'd first try it out and see how much you can process in what amount time.
You can also process all batches from a single cron execution: it does the first page and uses the scheduler to run itself again with the next page, passing in the pagination cursor. So with a single cron execution you can start a process that works through the whole time period the cron is responsible for. make sense?
Agreed, thank you Michal!
Hmmm that sounds quite clever, so essentially a single cron execution to initiate a chain of processing tasks, each scheduling the next until all necessary data has been processed. As long as i save and fetch the current cursor from the db, seems like it should work well. Also, looking at how errors are handled and it seems like as long as its not a developer error, convex will retry indefinitely. I think i'll have to try this, thank you Ian!
Finally getting a chance to delve into this...few follow up questions if i may..
- Just came across
A single function can schedule up to 1000 functions with total argument size of 8MB.
. Is each recursive invocation its own instance/separate function or is every recursive call accumulated towards this 1000 fn limit? If so, no worries as this is more of an issue that arrives at scale as Michal noted, but would still like to know so i can fine tune and be aware
- The following seems okay to me, but do i have the right idea here?
Is each recursive invocation its own instance/separate functionYes, so the invocation do not "add up" towards the limit (so more invocations, more data that can be processed). Your code looks directionally good to me!
thats amazing, thank you Michal!
i think i caused an infinite loop...had to cancel the invocation manually.
The basis of my logic was the following...
I'm assuming continueCursor is not a good choice for this?
Looking into this further now..
hmm alright seems like
isDone
and page.length
are the only reliable ways to stop the recursion..
Could you confirm this please?
Also, how does the continueCursor continue to have a value? Does it circle back to the first item?
Lastly, i am SO GLAD you guys have a manual kill switch.isDone
(not page.length or continueCursor) is the only way to tell that pagination is done.
continueCursor will still have a value which means "the end" which is distinct from null which means "the beginning".Thank you for clarifying!