Mugi
Mugi2w ago

Qwisine14B: FINAL Evaluation of a Fine-Tuned Model on Convex

Qwisine: Early Evaluation of a Fine-Tuned Model on Convex TL;DR Qwisine is a fine-tuned Qwen3-14B model trained on Convex documentation and synthetic reasoning data. After a first round of training (∼700k tokens), it performs on par with GPT-4.1 across key Convex development tasks beating modells like DeepSeek R1, Grok 3 Mini (Beta), , GPT-4o etc. More practical data and larger model evaluations are planned. Dataset & Training The initial dataset was built from the Convex documentation, forming question–answer pairs across types like “what,” “why,” “how to,” and “task.” Each was paired with relevant context chunks. Later Synthetic reasoning data was added using Claude 3.7-thinking to improve depth. This first phase used ~700k tokens. Evaluation – Phase 1 * Model: Qwisine (Qwen3-14B fine-tune) * Evaluation scope: Core Convex categories * Average score: 65.47% Next Steps * Add ~500k tokens of real-world Convex code and tasks * Incorporate more practical, implementation-based examples * Fine-tune and evaluate a larger 32B model Why Convex? I chose Convex simply because a friend uses it and talks about it all the time (ultimate convex glazer @v ). and seemed like good opportunity to learn something new. PS: I’ll share links to Hugging Face and Datasette later — they’re still rough, and since I’m new to this and learning as I go, I’m a bit embarrassed to put them out just yet. But they’re coming!
No description
32 Replies
Mugi
MugiOP2w ago
evaluation was done using https://github.com/get-convex/convex-evals repo. what i noticed during testing/evaluation is that this score is not totally accurate in a lot of tests the modell simply forgot small details like package json generation, or small typo(to get perfect score)
v
v2w ago
This is fire, now we need components evals Ai ahhhh writing Someone loan a GPU farm?
james
james2w ago
very cool! also "ultimate convex glazer" is a great title 😆 i'm sure @jordan has more opinions but yes the evals repo isn't perfect. in particular i think that just because a model does best in the evals doesn't mean it's necessarily ideal for Chef, where promptability is often more important than scoring
Wayne
Wayne2w ago
@v just collected a new role!
v
v2w ago
I'm honored 😌
mikeysee
mikeysee2w ago
Oh wow this is awesome @Mugi !! We also noticed that AI sometimes "forgot small details like package json generation" when we were running the evals. I assumed this was because we were perhaps pushing the model to do a little too much in the task and it was reaching its limit of instruction following. For me this showed up when we were trying to do things like "tripppley nested joins" or something like that. What sort of datasets are you looking for? We might be able to help with that? Stack posts, discord content, #ask-ai questions?
Mugi
MugiOP2w ago
My next plan was @v Send me links to stack post and also a some demo repos so I'm planning to create second part of dataset composed of real world convex code examples by using code from repos to try to reverse the input aka possible user questions to get answer(code from repos). To further enhance models usefulness outside of documentation and real world usage. @mikeysee ask-ai thread questions would've been gold mine, you can only get so far on synthetic input so that would rly help Tbh end goal is not modell that is perfect to answer / help with convex questions on its own, instead by combining the fine tuning modell, convex MCP and also RAG to achieve most accurate results
mikeysee
mikeysee2w ago
I was just looking at our Kapa.ai dataset and there is no export functionality. Im not sure how useful that dataset would be anyways as it seems like there are a lot of questions from people on the convex docs that are instead people super confused with what is Chef and what is an AI Docs Bot.. So we could limit it to the #ask-ai channel in which case we would need a way to use discord to export the data from that channel. I might be able to write a script that could hit the discord API and export the data.. Ill have a think
No description
mikeysee
mikeysee2w ago
I have just been informed that we have this site: https://discord-questions.convex.dev/ feel free to scrape that if it helps 🙂
Convex Community
Convex Community Community
Join the Convex Discord! Explore Convex, the fullstack TypeScript platform for developers and startup founders.
Mugi
MugiOP2w ago
oh!!! its perfect thank you well, today i will try second phase which in theory should be much, much better for real worlds tasks, correct syntax. hoping to do evaluations + build an app with the help of this itteration of the modell and compare it to chef in terms of tool calling, instruction following, etc. new size of dataset ->
old ~700k tokens new ~3mil tokens
will run some post processing tasks before fine tuning this new dataset includes more practical examples, edge cases, real world usages, more complex inputs with code snippets simulating real user questions & tasks. Preliminary score Phase 2 ~3 tokens:
70.37% - second eval round try due an error
had to stop training on third epoch, noticed overfitting, well have to re run and experiment. the only who didnt increase, is data modelling it actually dropped from 72% from phase 1 to 66%, every other metric inscreased noticibly. tho im hoping for even better result if i manage to finish all epochs proud top 4, if i manage to finish all epochs prob will reach top 3 and take deepseek v3 place also tested in, in editor (jetbrains+ lm studio) directly with tool calling. saw no degradation in tool calling which is fire!❤️‍🔥
Mugi
MugiOP2w ago
TLDR
No description
Mugi
MugiOP2w ago
I think after I hopefully finish phase 2, Phase 3 will be last. Hopping for second place and I can sunset this, hobby project.
mikeysee
mikeysee7d ago
wow nice!! Thats aweomse work, really good to know you can get super far with this I can only imagine how good we could get if we were able to fine tune sonnet 3.5
radix
radix5d ago
I just came across this while browsing the discord. If you haven't already, you could use unsloth to cut down on the training requirements (almost by half), it also has ollama and vllm exporter so the model you train could directly be used by most people (if you open-source it) or used for chef easily
Mugi
MugiOP3d ago
Yeah I will open source it today! And link huggingface to dataset, new evulation results , and gguf modell. I used axolotl initially then unsloth . Thank you Plus not sure if convex allows for chef to use own modell/local since it ain't open source This modell imho after new evulation is more of sidekick during convex development to draft / fix / create towards 1 task at a time I will be also releasing in coming days a rag ready chunked dataset, that has been Enhanced, cleaned up and structured perfectly for rag so using that + Qwisine will likely achieve top performance for all purposes https://huggingface.co/mugivara1/Qwisine arlight bois and girls
Mugi
MugiOP3d ago
final score 73.03% if i had hardware to run 32B i bet my house it would be top 1 / 2 instead of top 3 Training sets & validation set https://huggingface.co/datasets/mugivara1/convex-reasoning-new-train https://huggingface.co/datasets/mugivara1/convex-reasoning-validation https://huggingface.co/datasets/mugivara1/convex-reasoning
Mugi
MugiOP3d ago
No description
Mugi
MugiOP3d ago
if i only had hardware to run 32b (i kinda can but its a bit of hassle) im confident it would score first / second. 14b is most realistic to run on consumer hardware with atleast 16GB vram.
mikeysee
mikeysee19h ago
Incredible work! amazing that you can get so far with this. Im going to have to do a video on this I think 🙂 Hopefully I am able to run this model locally on my poor 3070!
v
v15h ago
Lol I'm not sure you can but good luck! If I could let you borrow my 7900xt I would it's not great for AI but I think it would work Actually we might be able to make that happen @Mugi thoughts? Well not right now I don't have Internet but you know once I get it
Mugi
MugiOP15h ago
I can maybe put it on some hosting service for the video. But will have to taken down it after that Or I could find tune a 8B version instead Would be dumber but manageable on 3070
v
v14h ago
I was thinking tailscale
radix
radix12h ago
If you quantize with ollama I think you can make it fit Q4_K would bring it down to 7 GB with acceptable performance, Q3_0 to ~5.6 GB but quality would be a bit lower
Mugi
MugiOP12h ago
Yeah but context size Remember if he wants to try thinking / reasoning mode. One decent request could be up to ~8k tokens
v
v12h ago
I think my brain was quantized to the max
Mugi
MugiOP12h ago
That's not a good thing
v
v12h ago
That's the point I ain't got no context left
Mugi
MugiOP12h ago
Oh self glazing again
v
v12h ago
The opposite but we all know I'm a genius so no point in bringing it up
Mugi
MugiOP12h ago
Well if Mikey wants il help him set it up, proper parameters and all nuance's
v
v11h ago
Good boy My Internet still gone My phone provider banned me because I bypassed the data cap Had to get a T-Mobile esim

Did you find this page helpful?