Jacob Kim
Jacob Kim6mo ago

For language other than English, Korean

For language other than English, Korean in my case, fuzzy search is not working. Prefix search is working. Does anyone have this issue?
7 Replies
sujayakar
sujayakar6mo ago
hey @Jacob Kim -- yeah we haven't tested fuzzy matching much for non-english languages yet. can you share more on the behavior you're seeing? specifically, what your query string is and which documents you think should match but don't? (as a non-korean speaker this would be super helpful.) I'm also curious at a higher level how typo tolerance works in korean. is it generally that each wrong character is considered one typo? or I remember that hangul syllables decompose into letters -- or is each wrong letter a typo?
Jacob Kim
Jacob KimOP6mo ago
so here is an example Korean string "가나다라" - I search for "가" (first index) it correctly performs a prefix search. - I search for "나" or "다" (2nd and 3rd index) returns nothing. Not sure how typo tolerance work with Korean either. I would really love to be able to just run String.prototype.includes() with serachIndex. Don't even need fancy fuzzy search and typo tolerancy.
sujayakar
sujayakar6mo ago
ah, yeah this is expected -- even if typo tolerance is working as expected, it'll only allow one typo for words between 5 and 8 characters and two typos for works over 8 characters. for your use case, do you know which query strings you would want to use with String.prototype.includes()? would it always be a single character (like "가", "나", or "다")?
Jacob Kim
Jacob KimOP6mo ago
sorry for the late response @sujayakar . my query string won't always be a single character. I tested and seems like query string starts working when the query string length is greater than 5. For instance: actual string: 가나다라마바 The following works because of prefix matching: - - 가나 - 가나다... The following does not work: (this is substring from index 1) - - 나다 - 나다라 - 나다라마 This starts working: - 나다라마바 Since convex does not have LIKE %STR% like regular SQL, is the only option for text search the search index? Should I use other DB if I am using non-english language?
sujayakar
sujayakar6mo ago
yeah, that makes sense. since 나다라마바 has length 5, it permits one typo, so inserting at the beginning is considered a "typo" and matches. but note that it won't match more than one insertion. note that in most SQL databases, LIKE %STR% can't use an index and is just scanning over all of the rows. you can replicate this behavior in convex by filtering within javascript:
const searchString = '나다라마바';
for await (const message of ctx.db.query("messages")) {
if (message.text.include(searchString)) {
console.log(`${message._id} matches!`);
}
}
const searchString = '나다라마바';
for await (const message of ctx.db.query("messages")) {
if (message.text.include(searchString)) {
console.log(`${message._id} matches!`);
}
}
this will read all of the rows from the table, but then you can do whatever filtering logic you'd like in JS. there's a more advanced indexing approach that uses a "trigram index" (postgres has this as an extension: https://www.postgresql.org/docs/current/pgtrgm.html) that we could build on top of text search -- let me know if you're interested in exploring that.
Jacob Kim
Jacob KimOP6mo ago
Ah of course, I always forget that SQL isn't magic. 1. If I do the filtering logic in JS, I assume it will cause higher "action compute" charge? 2. I didn't know about this extension, so thanks for letting me know. Seems like pg_trgm only supports alphabetical languages (no Korean, Japanese, Chinese, etc). But pg_bigm seems to support Korean as well. I guess my current best bet is to implement an efficient text search function with JS.
sujayakar
sujayakar6mo ago
you'd do (1) in a query, so it'll just count as a function call + the database bandwidth to read the records. then, you can call that query from an action with ctx.runQuery if needed. ah that's really interesting for (2). in english we use trigrams since there's too few bigrams (26^2 = 676) and too many quadgrams (26^4 = 456976) -- see https://swtch.com/~rsc/regexp/regexp4.html#regexp for more details. but in korean it looks like there are ~11,000 distinct syllables, so then there'd be ~100 million possible bigrams? I haven't used something like pg_bigm in practice, but I'm curious how well it performs on large dataset sizes in korean.

Did you find this page helpful?