ott•2mo ago

Oh is it literally another db table? The

Oh is it literally another db table? The codebase is large, but (at least for postgres) it looks like the indexes use something that implements the Persistence trait, which the postgres client does GitHub

4 Replies

sujayakar•2mo ago

yeah, exactly! we encode values being indexed into byte strings that respect sort order (https://github.com/get-convex/convex-backend/blob/main/crates/value/src/sorting.rs#L1) and then use postgres/mysql/sqlite as a sorted key value store: https://github.com/get-convex/convex-backend/blob/dae54b500b6fc02455a314a33f921dc1b93281ed/crates/postgres/src/lib.rs#L703

ottOP•2mo ago

whats the reason for using relational databases instead of kv stores for the storage backend? if everything is a byte string, and the source of truth is a LOG, wouldn't using a LSM backed storage engine be more efficient than a b+tree backed one?

sujayakar•2mo ago

yeah it's a good question! honestly, it's just convenience -- we make a simple Persistence trait that could be implemented by either an LSM or BTree, and we use postgres since managed services like RDS are widely available. the LSM vs. BTree decision is pretty interesting. one pecularity of our workload for querying indexes is that we need to support multiversioning. so, we want to be able to query a snapshot of the index at slightly stale timestamps. one way to implement this with a regular sorted map is by storing (key, ts, value) and using a "loose index scan" to find the most recent revision per key that's less than or equal to the snapshot timestamp. this then shifts the workload towards more range scans, which are (often) better on btrees. but, all of this is to say we just chose it out of convenience and want to maintain a strong abstraction at Persistence so we can swap it out later.

ottOP•2mo ago

thanks for answering my questions. I find this stuff super interesting and I'm currently doing database research at uni.

we want to be able to query a snapshot of the index at slightly stale timestamps. this then shifts the workload towards more range scans, which are (often) better on btrees.

for the LSM case, you could have a key of key;<MAX_TS - t> so that keys are sorted most recent first

user:42;990  ->  t=9
user:42;995  ->  t=4
user:42;999  ->  t=0

user:42;990  ->  t=9
user:42;995  ->  t=4
user:42;999  ->  t=0

I did some rough IO estimates for the FIND document WHERE id = <id> AND timestamp <= t (please correct me if I got anything wrong). it looks like they have similar best case scenerios, but the LSM struggles when L0 is tiered. If it was leveled, I think the IO numbers could be pretty similar. key size: 64 bytes value size: 448 bytes entry size: 512 bytes OS page size: 4096 bytes 1 billion entries = 512 GB total storage query: read document/row at timestamp lsm (e.g. rocskdb): - size ratio (T) = 10 - L0 tiered, L1.. leveled - L0 = 10 MB - entries per page = 4096 bytes / 512 bytes = 16 - 1 MB skiplist as the memory buffer - bloom filters with 1% false positive rate - fence pointers on each level/SSTable

| Level |   size | num Entries | num SSTables |
| ----- | ------:| -----------:| ------------:|
| 0     |  10 MB |         20k |   1-10 (var) |
| 1     | 100 MB |        200k |           10 |
| 2     |   1 GB |          2m |          100 |
| 3     |  10 GB |         20m |           1k |
| 4     | 100 GB |        200m |          10k |
| 5     | 400 GB |        777m |          40k |

| Level |   size | num Entries | num SSTables |
| ----- | ------:| -----------:| ------------:|
| 0     |  10 MB |         20k |   1-10 (var) |
| 1     | 100 MB |        200k |           10 |
| 2     |   1 GB |          2m |          100 |
| 3     |  10 GB |         20m |           1k |
| 4     | 100 GB |        200m |          10k |
| 5     | 400 GB |        777m |          40k |

better case 1. search memory buffer(s): 0 IOs 2. L0 search: 1 IO = 1 IO worse case 1. search memory buffer(s): 0 IOs 2. L0: 10 IOs for each SSTable (they have overlapping key ranges) 3. L1-L5: 1 IO per level (non overlapping key ranges) = 5 IOs. fence pointer cached in memory = 15 IOs b+tree: - leaf nodes: 64 bytes (key) + 8 bytes (pointer) = 72 bytes - non-leaf nodes: 64 bytes + 8 bytes = 72 bytes - ~56 nodes / page - h = 5 better case: 1. all leaf nodes cached 2. 1 IO for value read = 1 IO worse case: 1. 5 IOs for index traversal 2. 1 IO for value read = 6 IOs I didn't consider that every query needs to also be on the timestamp because of the optimistic CC, so it makes sense why a b+tree would perform better

Oh is it literally another db table? The

Did you find this page helpful?