Which identifier generator library to use for message channel and threads?
1) I am building a Discord-like messaging experience.
For servers, channel, threads, etc. it appears that Discord uses Snowflake IDs, 18-integer long numeric values.
What would you recommend for a Convex project? It seems that for a single Convex project, IDs that are unique over distributed database nodes is not important, but sort-ability and query performance are important.
If I wanted to keep my programming efforts really simple, I could simply use a _id field values themselves, but 32-characters is a lot of characters to put in URLs, especially when multiple resources are involved. 2) Likewise, what would you recommend for a user profile external identifier? Looks like LinkedIn appends a 8-character long suffix that uses a combination of all-lowercase letters and numbers.
If I wanted to keep my programming efforts really simple, I could simply use a _id field values themselves, but 32-characters is a lot of characters to put in URLs, especially when multiple resources are involved. 2) Likewise, what would you recommend for a user profile external identifier? Looks like LinkedIn appends a 8-character long suffix that uses a combination of all-lowercase letters and numbers.
9 Replies
following this.
i am building a crm and ids are crucial part in tracking down leads and rn i'm using _id but its too long.
i was thinking about setting up a internal mutation to count the number of records and +1 everytime and also give system admin the ability where they want to start for example if they write 3000 then it'll go from 3000 +1 +2 +3
i use to do this like 10 years ago with php.
but i am not sure with how efficient it'd be for convex or if there's any better way where ids can be 4-5 numbers long max.
con: scalibity issues of running out of numbers
My tentative solution is to use CUID2, set the length to say, 8, until some identifier collides in production, and then increase the length
But I’m wondering if there’s something about Convex db that would make a different generator better. For example, maybe querying on number type is just way faster
Or maybe there’s something I can’t readily anticipate, eg sort-ability matters a lot
I like the oslo API:
For strings: https://oslo.js.org/reference/crypto/generateRandomString (You give it length and alphabet.)
For unsigned integers: https://oslo.js.org/reference/crypto/generateRandomInteger (you give it max)
I would determine what entropy you need, based on number of items and desired collision rate.
Then I would use
number
column if you're within MAX_SAFE_INTEGER and don't care about future compat
int64
if you're above or not sure if you'll need to expand in the future
string
if you use strings
I think that "random + check for collision" will perform better than "incrementing by one", which will lead to more contention (and hence potential OCC errors).
Use an index on the column. The data type doesn't affect how fast the lookup will be (meaningfully). It will affect the storage size.Thanks Michal.
1) I can't find in Discord where Convex staff talked about how the primary key values are generated. But given that theres various pieces of logic that compose the 32 character, does it make any sense to use part of the primary key for this use case? I'm guessing it does not make sense.
2) Can you say more about the storage size, database bandwidth, etc. implications? I imagine using the number data type (and indexing that column) uses significantly less disk space.
3) Regarding forwards compatibility, it seems that the number data type cannot safely store snowflake IDs. I don't think my app will ever need to accommodate an 18-digit long snowflake numbers, but it seems that if you don't use a snowflake-like number, then you don't have sort-ability. I can't readily think of a reason I would need sort-ability, but would recommend having a sort-able ID for my use case?
4) I checked out Oslo. Seems to led by the creator of Lucia Auth. Why do you like Oslo? (as opposed to say, CUID2, or nanoId)
I'm guessing it does not make sense.It does not, the format is subject to change.
I imagine using the number data type (and indexing that column) uses significantly less disk space.float64 and int64 use 8 bytes, strings are stored as utf-8, so 2-4 bytes per character
would recommend having a sort-able ID for my use caseI'm not sure what sortability means here. Integers and strings are sortable (orderable).
(as opposed to say, CUID2, or nanoId)I didn't test these. You might prefer them. I have learned a lot form Lucia Auth and related resources.
Ah, some of these ID generator libraries have sortable by created time as a feature.
For example, sorting snowflake IDs also sorts it by created time.
Snowflakes are sortable by time, because they are based on the time they were created
https://en.wikipedia.org/wiki/Snowflake_ID
I think CUID2 did at one point in history, but seems not to anymore:
https://github.com/paralleldrive/cuid2#note-on-k-sortablesequentialmonotonically-increasing-ids
Snowflake ID
Snowflake IDs, or snowflakes, are a form of unique identifier used in distributed computing. The format was created by Twitter (now X) and is used for the IDs of tweets. It is popularly believed that every snowflake has a unique structure, so they took the name "snowflake ID". The format has been adopted by other companies, including Discord and...
GitHub
GitHub - paralleldrive/cuid2: Next generation guids. Secure, collis...
Next generation guids. Secure, collision-resistant ids optimized for horizontal scaling and performance. - paralleldrive/cuid2
I'll invoke @sujayakar our Chief ID Officer on this question.
The idaddy if you will
chief ID officer reporting for duty! yeah, any of these are fine (nanoid, cuid2, generateRandomString). if you don't need sortability by creation time, I'd recommend avoiding it since it'll make the identifiers have less entropy (and therefore need to be longer).
so, if the goal is to have a short, human visible ID, the main questions are...
1. how many bits of randomness do you need? the fewer bits the shorter the string, but the IDs become more guessable/enumerable and likely to collide. you can also make this variable where you start adding more bits as the tables get larger.
2. which alphabet do you use for encoding the random bits? the larger the alphabet, the shorter the string. it's generally worth being URL safe, avoiding ambiguous characters (e.g.
O
vs 0
), and making it possible to accidentally generate swear words (lol)
you can play with these parameters (and get a code snippet for nanoid
) with this calculator: https://zelark.github.io/nano-id-cc/