Jacob Kim•8mo ago

For language other than English, Korean

For language other than English, Korean in my case, fuzzy search is not working. Prefix search is working. Does anyone have this issue?

7 Replies

sujayakar•8mo ago

hey @Jacob Kim -- yeah we haven't tested fuzzy matching much for non-english languages yet. can you share more on the behavior you're seeing? specifically, what your query string is and which documents you think should match but don't? (as a non-korean speaker this would be super helpful.) I'm also curious at a higher level how typo tolerance works in korean. is it generally that each wrong character is considered one typo? or I remember that hangul syllables decompose into letters -- or is each wrong letter a typo?

Jacob KimOP•8mo ago

so here is an example Korean string "가나다라" - I search for "가" (first index) it correctly performs a prefix search. - I search for "나" or "다" (2nd and 3rd index) returns nothing. Not sure how typo tolerance work with Korean either. I would really love to be able to just run String.prototype.includes() with serachIndex. Don't even need fancy fuzzy search and typo tolerancy.

sujayakar•8mo ago

ah, yeah this is expected -- even if typo tolerance is working as expected, it'll only allow one typo for words between 5 and 8 characters and two typos for works over 8 characters. for your use case, do you know which query strings you would want to use with String.prototype.includes()? would it always be a single character (like "가", "나", or "다")?

Jacob KimOP•8mo ago

sorry for the late response @sujayakar . my query string won't always be a single character. I tested and seems like query string starts working when the query string length is greater than 5. For instance: actual string: 가나다라마바 The following works because of prefix matching: - 가 - 가나 - 가나다... The following does not work: (this is substring from index 1) - 나 - 나다 - 나다라 - 나다라마 This starts working: - 나다라마바 Since convex does not have LIKE %STR% like regular SQL, is the only option for text search the search index? Should I use other DB if I am using non-english language?

sujayakar•8mo ago

yeah, that makes sense. since 나다라마바 has length 5, it permits one typo, so inserting 가 at the beginning is considered a "typo" and matches. but note that it won't match more than one insertion. note that in most SQL databases, LIKE %STR% can't use an index and is just scanning over all of the rows. you can replicate this behavior in convex by filtering within javascript:

const searchString = '나다라마바';
for await (const message of ctx.db.query("messages")) {
  if (message.text.include(searchString)) {
    console.log(`${message._id} matches!`);
  }
}

const searchString = '나다라마바';
for await (const message of ctx.db.query("messages")) {
  if (message.text.include(searchString)) {
    console.log(`${message._id} matches!`);
  }
}

this will read all of the rows from the table, but then you can do whatever filtering logic you'd like in JS. there's a more advanced indexing approach that uses a "trigram index" (postgres has this as an extension: https://www.postgresql.org/docs/current/pgtrgm.html) that we could build on top of text search -- let me know if you're interested in exploring that.

Jacob KimOP•8mo ago

Ah of course, I always forget that SQL isn't magic. 1. If I do the filtering logic in JS, I assume it will cause higher "action compute" charge? 2. I didn't know about this extension, so thanks for letting me know. Seems like pg_trgm only supports alphabetical languages (no Korean, Japanese, Chinese, etc). But pg_bigm seems to support Korean as well. I guess my current best bet is to implement an efficient text search function with JS.

sujayakar•8mo ago

you'd do (1) in a query, so it'll just count as a function call + the database bandwidth to read the records. then, you can call that query from an action with ctx.runQuery if needed. ah that's really interesting for (2). in english we use trigrams since there's too few bigrams (26^2 = 676) and too many quadgrams (26^4 = 456976) -- see https://swtch.com/~rsc/regexp/regexp4.html#regexp for more details. but in korean it looks like there are ~11,000 distinct syllables, so then there'd be ~100 million possible bigrams? I haven't used something like pg_bigm in practice, but I'm curious how well it performs on large dataset sizes in korean.

For language other than English, Korean

Did you find this page helpful?