For language other than English, Korean
For language other than English, Korean in my case, fuzzy search is not working. Prefix search is working. Does anyone have this issue?
7 Replies
hey @Jacob Kim -- yeah we haven't tested fuzzy matching much for non-english languages yet.
can you share more on the behavior you're seeing? specifically, what your query string is and which documents you think should match but don't? (as a non-korean speaker this would be super helpful.)
I'm also curious at a higher level how typo tolerance works in korean. is it generally that each wrong character is considered one typo? or I remember that hangul syllables decompose into letters -- or is each wrong letter a typo?
so here is an example Korean string
"가나다라"
- I search for "가" (first index) it correctly performs a prefix search.
- I search for "나" or "다" (2nd and 3rd index) returns nothing.
Not sure how typo tolerance work with Korean either. I would really love to be able to just run
String.prototype.includes()
with serachIndex. Don't even need fancy fuzzy search and typo tolerancy.ah, yeah this is expected -- even if typo tolerance is working as expected, it'll only allow one typo for words between 5 and 8 characters and two typos for works over 8 characters.
for your use case, do you know which query strings you would want to use with
String.prototype.includes()
? would it always be a single character (like "가", "나", or "다")?sorry for the late response @sujayakar .
my query string won't always be a single character. I tested and seems like query string starts working when the query string length is greater than 5.
For instance:
actual string: 가나다라마바
The following works because of prefix matching:
-
가
- 가나
- 가나다...
The following does not work: (this is substring from index 1)
- 나
- 나다
- 나다라
- 나다라마
This starts working:
- 나다라마바
Since convex does not have LIKE %STR%
like regular SQL, is the only option for text search the search index? Should I use other DB if I am using non-english language?yeah, that makes sense. since
나다라마바
has length 5, it permits one typo, so inserting 가
at the beginning is considered a "typo" and matches. but note that it won't match more than one insertion.
note that in most SQL databases, LIKE %STR%
can't use an index and is just scanning over all of the rows. you can replicate this behavior in convex by filtering within javascript:
this will read all of the rows from the table, but then you can do whatever filtering logic you'd like in JS.
there's a more advanced indexing approach that uses a "trigram index" (postgres has this as an extension: https://www.postgresql.org/docs/current/pgtrgm.html) that we could build on top of text search -- let me know if you're interested in exploring that.Ah of course, I always forget that SQL isn't magic.
1. If I do the filtering logic in JS, I assume it will cause higher "action compute" charge?
2. I didn't know about this extension, so thanks for letting me know. Seems like
pg_trgm
only supports alphabetical languages (no Korean, Japanese, Chinese, etc). But pg_bigm
seems to support Korean as well.
I guess my current best bet is to implement an efficient text search function with JS.you'd do (1) in a query, so it'll just count as a function call + the database bandwidth to read the records. then, you can call that query from an action with
ctx.runQuery
if needed.
ah that's really interesting for (2). in english we use trigrams since there's too few bigrams (26^2 = 676) and too many quadgrams (26^4 = 456976) -- see https://swtch.com/~rsc/regexp/regexp4.html#regexp for more details.
but in korean it looks like there are ~11,000 distinct syllables, so then there'd be ~100 million possible bigrams? I haven't used something like pg_bigm
in practice, but I'm curious how well it performs on large dataset sizes in korean.