mathematicalmichael
mathematicalmichael•2mo ago

vectorSearch filter limitations

Hi! Im building a multi-user app for image library management and am incorporating vector search for retrieving similar images. everything was actually quite smooth at first - I defined my images table to have an ownerId and used that same field as a filter field for the vectorIndex. As a basic filter this worked splendidly for access control.
const vectorResults = await ctx.vectorSearch("images", "by_embedding", {
vector: result.embeddings,
limit: requestedLimit,
filter: (q) => q.eq("ownerId", userId)
});
const vectorResults = await ctx.vectorSearch("images", "by_embedding", {
vector: result.embeddings,
limit: requestedLimit,
filter: (q) => q.eq("ownerId", userId)
});
the problem I began to encounter occurred once I began incorporating search into album-management. Goal 1: Search for relevant images within an album Goal 2: Find relevant images that are NOT already in this album (to add them in) Design Restriction: Images can belong to multiple albums - that state is tracked in another table, not alongside the images themselves. Goal 1 is currently implemented by searching for all the users images and filtering after the fact by membership to an album. This approach will get starved of results due to the 256 return limit once the album gets large (if relevant images not in album rank above anything inside it). Goal 2 is similar: find all user's images and filter out the ones currently in album. This will starve once the album is full of highly ranked images. I've been thinking about how to get around this. - Brute force with paginating vectorSearch results - Returning more than 256 (just kicks the can down the road). But that's inefficient. The only solution I thought of involved a lot of data duplication: copying image embeddings multiple times (one per album) and using the ownerId:albumId as the filter key / index instead. But that still only really addresses Goal 1 (Id be able to cope with max 256 results + no starvation), not Goal 2. Some advice / guidance would be greatly appreciated! Id really prefer not to move vector search outside convex!
19 Replies
Convex Bot
Convex Bot•2mo ago
Thanks for posting in <#1088161997662724167>. Reminder: If you have a Convex Pro account, use the Convex Dashboard to file support tickets. - Provide context: What are you trying to achieve, what is the end-user interaction, what are you seeing? (full error message, command output, etc.) - Use search.convex.dev to search Docs, Stack, and Discord all at once. - Additionally, you can post your questions in the Convex Community's <#1228095053885476985> channel to receive a response from AI. - Avoid tagging staff unless specifically instructed. Thank you!
mathematicalmichael
mathematicalmichaelOP•2mo ago
if filter expressions allowed AND + NEQ logic on vector search, then having a separate table with duplicated embeddings + ownerId and albumId could solve both goals. Yes - its duplicating data ... but the filter could be q.and(q.eq("ownerId", userId), q.eq("albumId", currentAlbum)) which would allow me to get up to 256 results for Goal 1. For Goal 2 the filter could be q.and(q.eq("ownerId", userId), q.neq("albumId", currentAlbum)) if NEQ was available To avoid duplicating data entirely, Id need to keep track of which albums each image belongs to in the images table (easy), but then that would require filters to be able to check for membership in this array. That feels way more elegant but complicated to implement.
mikeysee
mikeysee•2mo ago
hey this is a really interesting problem and I didnt have a good solution so I asked my buddy O3 and this is what it came up with: https://chatgpt.com/share/685e35fa-e8e8-8006-9042-e284767df905 That "ownerAlbum" composite key is pretty clever work around for the lack of "AND". So that should solve your goal 1. As for goal 2 it sounds like a batched loop is your best bet, not elegant but it should work. Maybe someone else will have a better solution?
mathematicalmichael
mathematicalmichaelOP•2mo ago
appreciate that! i did do quite a bit of debugging with AI before coming here. the composite key is a weird workaround and limiting. here's what I've done in the interim to prevent starvation: ... I just wrote my own convex server function (action) to perform vector search for me. the records are filtered ahead of time so that only the right ones are scored. no starvation. my vectors are normalized, so i implemented dot product to save on extra computations. I kept convex vector search in place for one of my search bars, but for the albums issue that was blocked, I use my own version. unblocked. In theory: if I impose a 5k max image-per-album limit and 10k max images limit, that should keep things small enough to be responsive with the server function and not run out of memory (fwiw - self hosting convex). convex is the store for embeddings, and lets me craft the query I need in non-vector-search mode (normal db sql) in order to perform custom vector-search the way i need it. how far this solution will scale as a server function is unclear - but i think that if i stay well below 100k records for NN computation that I should be fine. currently (for volumes in low hundreds - which will suit most users), custom vector search feels just as fast as convex vector search.
mikeysee
mikeysee•2mo ago
Awesome! sounds like a really smart work around plus adding some practical limits for your table. I think this is a really use-case example tho that I could do a video on at some point as its one of those ones that the problem is clear but the solution really isnt
mathematicalmichael
mathematicalmichaelOP•2mo ago
that'd be awesome. is the filter logic limitation due to the implementation of vector search? ive been testing my solution, and while it definitely avoids starvation, it starts to feel slower than convex vector search when the set to search over grows to about 1k. in other words, im very much interested in contributing to the technical discussion over how to achieve these goals natively in convex. afaik chroma can handle AND filters so i believe its possible. as well as negations - which solves starvation for both goals. id just rather not implement another tool if i can do it natively.
mikeysee
mikeysee•2mo ago
Ye sorry we are reaching my limit of knowledge around the internal implementation of the vector search indicies / filters. hmm who would be best to help on this one do you think @ian ? The TLDR is that @mathematicalmichael is reaching limits of what can be done with vector search filters and is wondering if there is some technical limitation there or if it just simply hasnt been requested up to now and thus hasnt been implemented?
ian
ian•2mo ago
I would avoid doing the comparison on more than 1k in a single query - if you have embeddings of length 1536, then that's already 12MB of bandwidth reading, and the limit iirc is 16MB? Yes AND filters aren't supported due to the limitation of efficiency in pre-filtering. Also "neq" I believe is doing a huge scan and post-filtering, which can result in graph breakdown, last I was looking into the technicals, at least of qdrant. Doing the search with an index makes sense to me, and dot product should be good. Also limiting the size of the vectors may give you a big lift - e.g. for the OpenAI ones they claim you can use a prefix and get reasonable results - e.g. just use the first 256, though that's for the text embedding model. One thing to consider is how big each album is. If the album is <10% of the results, then you can do a search across everything, then post-filter yourself to exclude any in the current album. In degenerate cases you might end up with no results, but will likely be faster and less bandwidth (esp. if you store the embedding ID on another table with an index to avoid reading the embedding when you fetch the contents)
mathematicalmichael
mathematicalmichaelOP•2mo ago
thanks @ian ! im using self hosted embeddings of size 768, which are indeed also matryoshka-capable (truncation). in practice, ive found the retrievals to suffer considerably even at 512. at 768 (full), the text to image retrieval im seeing is really quite good. i embed both text and images. i definitely should move embeddings into their own table... the more i think about it, the additional cleanup logic is worthwhile. but back to the question of how to handle retrievals. i do want to avoid starvation, even in degenerate cases. post filtering can really hamper that. im wondering why AND filters are a limitation. iirc chromadb supports them. i appreciate your help as i try to understand the limits of my implementation and options available. the end goal is to be able to search across huge libraries and manage them into organized collections with ease. i think what makes it complicated is im trying to avoid duplicating data, as images can belong to many collections (and i need to filter on owner). so figuring out an indexing strategy that can work has evaded me so far. id really love to avoid rolling another microservice to handle vector db search and syncing state between it and convex.
ian
ian•2mo ago
Doing AND in pre-filter is an open research question. qdrant "supports" them but is doing it in post-filtering. I don't know how chromadb does it but I suspect it's leaning heavily on post-filtering too.
ian
ian•2mo ago
Have you seen what I did with the RAG component? https://github.com/get-convex/memory/blob/main/README.md#L153-L221
GitHub
rag/README.md at main · get-convex/rag
Document search component to aid RAG. Contribute to get-convex/rag development by creating an account on GitHub.
ian
ian•2mo ago
Getting an image to show up under many collections is really tricky unfortunately - I can't think of a clean way to do it atm
ian
ian•2mo ago
Here's what some deep research came up with - it seems like chroma does similar to what you are doing - do an index scan, then compare results within there - but they don't have indexes so they do roughly a table scan, unless things have changed
mathematicalmichael
mathematicalmichaelOP•2mo ago
after some research it seems that metadata as indices is still a WIP, and people get awful performance when trying to use it. i did take a look at RAG, yes. the namespaces could act like user boundaries? the multi indices got me thinking but the component itself didnt seem like something i could use. please correct me if im not seeing something. shoot. this is a fundamental issue for me, and ive been thinking about the db design for days (including back and forth w ai). given that sql can filter like this without issue, it feels so viable. i dont know what makes the vector part so different in kind. my current implementation works, but to your point hits memory limits in convex after a few thousand. even in self hosted, i cant modify that without a custom build, could I? i read that convex can handle millions of vectors, is it paginating or something in order to do the distance calculations? to add extra complexity, my app has versioned albums so i really hesitate to create entries for each revision. wrote up my design criteria here: https://2025.mpilosov.com/7/11/ according to that deep research, qdrant is doing something much fancier than post filtering wrt AND queries. im wondering - if its supported, is there a path to allowing it via convex (im assuming this is what it uses)?
MathematicalMichaelx2025
Digital Art Log for Dr. Michael Pilosov
ian
ian•2mo ago
The difficulty comes in the modeling of the HSWN - it's a cool data structure that might help inform some of this. The support for millions of vectors is not a scan of a regular table - it's using that data structure, which allows searching in a specific region, so you only see results near the vector - no need to read the other rows. I had an idea last night though - and also I don't actually understand why the RAG component wouldn't work here. You have one namespace per album, and a separate per-user namespace. When you add an image, you add it to all the namespaces (Over time I could adjust the API to make multi-namespace insertions more efficient). To search within an album, you search that namespace. To search for not in the album, you do a global search and post-filter out the ones already in the album. Degenerate case is if one album is 90% of the images and the results only have 10% representation of non-related ones. But from a user POV how many do they really need to see? And once you have some results, you could do other searches within those albums to show more, since images are probably clumped per-album. Another idea is to do a vector search for albums - have an embedding that represents the centroid of the images in it - then do a per-album search. If you do some napkin math you can see if this would have any real impact on cost. Compared to image storage, serving & embedding I suspect it won't but I haven't done the math
mathematicalmichael
mathematicalmichaelOP•2mo ago
i really appreciate that feedback. perhaps im not fully understanding the extensibility of the rag component, but i dont see how to insert my own vectors. the issue is, there's one model for images -> vector, forming the db. and a separate model for inference with text -> vector (both self hosted endpoints, neither follow openai embedding spec but i can make them do so if it helps at all) im wondering if its almost worth treating all uploads as having to belong to an album - ie "unsorted" or "uploads" and treating the "search for images not in album" as "search within uploads" - this could simplify things especially if I cap album size. i think the issue i run into no matter what is that post-filtering means i can at best add the top 256 results for a given query. to find more relevant images to add, the user will have to modify the query in some way, or simply trim down the uploads album. oh the joy of designing ux around technical constraints 🫠 i already handle redundant addition of images gracefully, so im considering just removing the feature of "search for not in album" entirely at this point... create only one way to grow an album: via global search, where convex search works out of the box, and text queries can be made as specific as they need to be. again - really appreciate your input on this topic.
ian
ian•2mo ago
You can pass in an array of objects to chunks that has an embedding set on each chunk. But not necessary to use it in this case - likely overkill since each would just have one entry and you're not looking to look them up as context. Good luck with it all! Having a no-album album search sounds useful - then as they add photos to their album, the results change b/c those aren't in the no-album results anymore, then a separate query can look in other albums. But yeah if you can cut scope overall and do global search you can always come back to more complex things based on real user behavior / needs
mathematicalmichael
mathematicalmichaelOP•2mo ago
thank you! it appears qdrant is used under the hood by convex - does that mean that more advanced filtering could be enabled in theory?
ian
ian•2mo ago
In theory, yes! It's not on the short term roadmap but lots of possibilities

Did you find this page helpful?