Oh is it literally another db table? The
Oh is it literally another db table? The codebase is large, but (at least for postgres) it looks like the indexes use something that implements the Persistence trait, which the postgres client does GitHub
4 Replies
yeah, exactly!
we encode values being indexed into byte strings that respect sort order (https://github.com/get-convex/convex-backend/blob/main/crates/value/src/sorting.rs#L1) and then use postgres/mysql/sqlite as a sorted key value store: https://github.com/get-convex/convex-backend/blob/dae54b500b6fc02455a314a33f921dc1b93281ed/crates/postgres/src/lib.rs#L703
whats the reason for using relational databases instead of kv stores for the storage backend? if everything is a byte string, and the source of truth is a LOG, wouldn't using a LSM backed storage engine be more efficient than a b+tree backed one?
yeah it's a good question!
honestly, it's just convenience -- we make a simple Persistence trait that could be implemented by either an LSM or BTree, and we use postgres since managed services like RDS are widely available.
the LSM vs. BTree decision is pretty interesting. one pecularity of our workload for querying indexes is that we need to support multiversioning. so, we want to be able to query a snapshot of the index at slightly stale timestamps.
one way to implement this with a regular sorted map is by storing (key, ts, value) and using a "loose index scan" to find the most recent revision per key that's less than or equal to the snapshot timestamp.
this then shifts the workload towards more range scans, which are (often) better on btrees.
but, all of this is to say we just chose it out of convenience and want to maintain a strong abstraction at
Persistence
so we can swap it out later.thanks for answering my questions. I find this stuff super interesting and I'm currently doing database research at uni.
we want to be able to query a snapshot of the index at slightly stale timestamps. this then shifts the workload towards more range scans, which are (often) better on btrees.for the LSM case, you could have a key of
key;<MAX_TS - t>
so that keys are sorted most recent first
I did some rough IO estimates for the FIND document WHERE id = <id> AND timestamp <= t
(please correct me if I got anything wrong). it looks like they have similar best case scenerios, but the LSM struggles when L0 is tiered. If it was leveled, I think the IO numbers could be pretty similar.
key size: 64 bytes
value size: 448 bytes
entry size: 512 bytes
OS page size: 4096 bytes
1 billion entries = 512 GB total storage
query: read document/row at timestamp
lsm (e.g. rocskdb):
- size ratio (T) = 10
- L0 tiered, L1.. leveled
- L0 = 10 MB
- entries per page = 4096 bytes / 512 bytes = 16
- 1 MB skiplist as the memory buffer
- bloom filters with 1% false positive rate
- fence pointers on each level/SSTable
better case
1. search memory buffer(s): 0 IOs
2. L0 search: 1 IO
= 1 IO
worse case
1. search memory buffer(s): 0 IOs
2. L0: 10 IOs for each SSTable (they have overlapping key ranges)
3. L1-L5: 1 IO per level (non overlapping key ranges) = 5 IOs. fence pointer cached in memory
= 15 IOs
b+tree:
- leaf nodes: 64 bytes (key) + 8 bytes (pointer) = 72 bytes
- non-leaf nodes: 64 bytes + 8 bytes = 72 bytes
- ~56 nodes / page
- h = 5
better case:
1. all leaf nodes cached
2. 1 IO for value read
= 1 IO
worse case:
1. 5 IOs for index traversal
2. 1 IO for value read
= 6 IOs
I didn't consider that every query needs to also be on the timestamp because of the optimistic CC, so it makes sense why a b+tree would perform better