ott
ott
CCConvex Community
Created by ott on 4/17/2025 in #general
Oh is it literally another db table? The
I didn't consider that every query needs to also be on the timestamp because of the optimistic CC, so it makes sense why a b+tree would perform better
6 replies
CCConvex Community
Created by ott on 4/17/2025 in #general
Oh is it literally another db table? The
I did some rough IO estimates for the FIND document WHERE id = <id> AND timestamp <= t (please correct me if I got anything wrong). it looks like they have similar best case scenerios, but the LSM struggles when L0 is tiered. If it was leveled, I think the IO numbers could be pretty similar. key size: 64 bytes value size: 448 bytes entry size: 512 bytes OS page size: 4096 bytes 1 billion entries = 512 GB total storage query: read document/row at timestamp lsm (e.g. rocskdb): - size ratio (T) = 10 - L0 tiered, L1.. leveled - L0 = 10 MB - entries per page = 4096 bytes / 512 bytes = 16 - 1 MB skiplist as the memory buffer - bloom filters with 1% false positive rate - fence pointers on each level/SSTable
| Level | size | num Entries | num SSTables |
| ----- | ------:| -----------:| ------------:|
| 0 | 10 MB | 20k | 1-10 (var) |
| 1 | 100 MB | 200k | 10 |
| 2 | 1 GB | 2m | 100 |
| 3 | 10 GB | 20m | 1k |
| 4 | 100 GB | 200m | 10k |
| 5 | 400 GB | 777m | 40k |
| Level | size | num Entries | num SSTables |
| ----- | ------:| -----------:| ------------:|
| 0 | 10 MB | 20k | 1-10 (var) |
| 1 | 100 MB | 200k | 10 |
| 2 | 1 GB | 2m | 100 |
| 3 | 10 GB | 20m | 1k |
| 4 | 100 GB | 200m | 10k |
| 5 | 400 GB | 777m | 40k |
better case 1. search memory buffer(s): 0 IOs 2. L0 search: 1 IO = 1 IO worse case 1. search memory buffer(s): 0 IOs 2. L0: 10 IOs for each SSTable (they have overlapping key ranges) 3. L1-L5: 1 IO per level (non overlapping key ranges) = 5 IOs. fence pointer cached in memory = 15 IOs b+tree: - leaf nodes: 64 bytes (key) + 8 bytes (pointer) = 72 bytes - non-leaf nodes: 64 bytes + 8 bytes = 72 bytes - ~56 nodes / page - h = 5 better case: 1. all leaf nodes cached 2. 1 IO for value read = 1 IO worse case: 1. 5 IOs for index traversal 2. 1 IO for value read = 6 IOs
6 replies
CCConvex Community
Created by ott on 4/17/2025 in #general
Oh is it literally another db table? The
thanks for answering my questions. I find this stuff super interesting and I'm currently doing database research at uni.
we want to be able to query a snapshot of the index at slightly stale timestamps. this then shifts the workload towards more range scans, which are (often) better on btrees.
for the LSM case, you could have a key of key;<MAX_TS - t> so that keys are sorted most recent first
user:42;990 -> t=9
user:42;995 -> t=4
user:42;999 -> t=0
user:42;990 -> t=9
user:42;995 -> t=4
user:42;999 -> t=0
6 replies
CCConvex Community
Created by ott on 4/17/2025 in #general
Oh is it literally another db table? The
whats the reason for using relational databases instead of kv stores for the storage backend? if everything is a byte string, and the source of truth is a LOG, wouldn't using a LSM backed storage engine be more efficient than a b+tree backed one?
6 replies