Qwisine14B: FINAL Evaluation of a Fine-Tuned Model on Convex
Qwisine: Early Evaluation of a Fine-Tuned Model on Convex
TL;DR
Qwisine is a fine-tuned Qwen3-14B model trained on Convex documentation and synthetic reasoning data. After a first round of training (∼700k tokens), it performs on par with GPT-4.1 across key Convex development tasks beating modells like DeepSeek R1, Grok 3 Mini (Beta), , GPT-4o etc. More practical data and larger model evaluations are planned.
Dataset & Training
The initial dataset was built from the Convex documentation, forming question–answer pairs across types like “what,” “why,” “how to,” and “task.” Each was paired with relevant context chunks.
Later Synthetic reasoning data was added using Claude 3.7-thinking to improve depth.
This first phase used ~700k tokens.
Evaluation – Phase 1
* Model: Qwisine (Qwen3-14B fine-tune)
* Evaluation scope: Core Convex categories
* Average score: 65.47%
Next Steps
* Add ~500k tokens of real-world Convex code and tasks
* Incorporate more practical, implementation-based examples
* Fine-tune and evaluate a larger 32B model
Why Convex?
I chose Convex simply because a friend uses it and talks about it all the time (ultimate convex glazer @v ). and seemed like good opportunity to learn something new.
PS: I’ll share links to Hugging Face and Datasette later — they’re still rough, and since I’m new to this and learning as I go, I’m a bit embarrassed to put them out just yet. But they’re coming!

32 Replies
evaluation was done using https://github.com/get-convex/convex-evals repo. what i noticed during testing/evaluation is that this score is not totally accurate in a lot of tests the modell simply forgot small details like package json generation, or small typo(to get perfect score)
This is fire, now we need components evals
Ai ahhhh writing
Someone loan a GPU farm?
very cool!
also "ultimate convex glazer" is a great title 😆
i'm sure @jordan has more opinions but yes the evals repo isn't perfect. in particular i think that just because a model does best in the evals doesn't mean it's necessarily ideal for Chef, where promptability is often more important than scoring
@v just collected a new role!
I'm honored 😌
Oh wow this is awesome @Mugi !!
We also noticed that AI sometimes "forgot small details like package json generation" when we were running the evals. I assumed this was because we were perhaps pushing the model to do a little too much in the task and it was reaching its limit of instruction following. For me this showed up when we were trying to do things like "tripppley nested joins" or something like that.
What sort of datasets are you looking for? We might be able to help with that? Stack posts, discord content, #ask-ai questions?
My next plan was @v Send me links to stack post and also a some demo repos so I'm planning to create second part of dataset composed of real world convex code examples by using code from repos to try to reverse the input aka possible user questions to get answer(code from repos). To further enhance models usefulness outside of documentation and real world usage. @mikeysee ask-ai thread questions would've been gold mine, you can only get so far on synthetic input so that would rly help
Tbh end goal is not modell that is perfect to answer / help with convex questions on its own, instead by combining the fine tuning modell, convex MCP and also RAG to achieve most accurate results
I was just looking at our Kapa.ai dataset and there is no export functionality. Im not sure how useful that dataset would be anyways as it seems like there are a lot of questions from people on the convex docs that are instead people super confused with what is Chef and what is an AI Docs Bot..
So we could limit it to the #ask-ai channel in which case we would need a way to use discord to export the data from that channel.
I might be able to write a script that could hit the discord API and export the data.. Ill have a think

I have just been informed that we have this site: https://discord-questions.convex.dev/ feel free to scrape that if it helps 🙂
Convex Community
Convex Community Community
Join the Convex Discord! Explore Convex, the fullstack TypeScript platform for developers and startup founders.
oh!!! its perfect thank you
well, today i will try second phase which in theory should be much, much better for real worlds tasks, correct syntax. hoping to do evaluations + build an app with the help of this itteration of the modell and compare it to chef in terms of tool calling, instruction following, etc.
new size of dataset ->
old ~700k tokens new ~3mil tokenswill run some post processing tasks before fine tuning this new dataset includes more practical examples, edge cases, real world usages, more complex inputs with code snippets simulating real user questions & tasks. Preliminary score Phase 2 ~3 tokens:
70.37% - second eval round try due an errorhad to stop training on third epoch, noticed overfitting, well have to re run and experiment. the only who didnt increase, is data modelling it actually dropped from 72% from phase 1 to 66%, every other metric inscreased noticibly. tho im hoping for even better result if i manage to finish all epochs proud top 4, if i manage to finish all epochs prob will reach top 3 and take deepseek v3 place also tested in, in editor (jetbrains+ lm studio) directly with tool calling. saw no degradation in tool calling which is fire!❤️🔥
TLDR

I think after I hopefully finish phase 2, Phase 3 will be last. Hopping for second place and I can sunset this, hobby project.
wow nice!!
Thats aweomse work, really good to know you can get super far with this
I can only imagine how good we could get if we were able to fine tune sonnet 3.5
I just came across this while browsing the discord. If you haven't already, you could use unsloth to cut down on the training requirements (almost by half), it also has ollama and vllm exporter so the model you train could directly be used by most people (if you open-source it) or used for chef easily
Yeah I will open source it today! And link huggingface to dataset, new evulation results , and gguf modell. I used axolotl initially then unsloth . Thank you
Plus not sure if convex allows for chef to use own modell/local since it ain't open source
This modell imho after new evulation is more of sidekick during convex development to draft / fix / create towards 1 task at a time
I will be also releasing in coming days a rag ready chunked dataset, that has been Enhanced, cleaned up and structured perfectly for rag so using that + Qwisine will likely achieve top performance for all purposes
https://huggingface.co/mugivara1/Qwisine
arlight bois and girls
final score
73.03%
if i had hardware to run 32B i bet my house it would be top 1 / 2 instead of top 3
Training sets & validation set
https://huggingface.co/datasets/mugivara1/convex-reasoning-new-train
https://huggingface.co/datasets/mugivara1/convex-reasoning-validation
https://huggingface.co/datasets/mugivara1/convex-reasoning

if i only had hardware to run 32b (i kinda can but its a bit of hassle)
im confident it would score first / second. 14b is most realistic to run on consumer hardware with atleast 16GB vram.
Incredible work!
amazing that you can get so far with this.
Im going to have to do a video on this I think 🙂
Hopefully I am able to run this model locally on my poor 3070!
Lol I'm not sure you can but good luck!
If I could let you borrow my 7900xt I would it's not great for AI but I think it would work
Actually we might be able to make that happen
@Mugi thoughts?
Well not right now I don't have Internet but you know once I get it
I can maybe put it on some hosting service for the video. But will have to taken down it after that
Or I could find tune a 8B version instead
Would be dumber but manageable on 3070
I was thinking tailscale
If you quantize with ollama I think you can make it fit
Q4_K would bring it down to 7 GB with acceptable performance, Q3_0 to ~5.6 GB but quality would be a bit lower
Yeah but context size
Remember if he wants to try thinking / reasoning mode. One decent request could be up to ~8k tokens
I think my brain was quantized to the max
That's not a good thing
That's the point
I ain't got no context left
Oh self glazing again
The opposite but we all know I'm a genius so no point in bringing it up
Well if Mikey wants il help him set it up, proper parameters and all nuance's
Good boy
My Internet still gone
My phone provider banned me because I bypassed the data cap
Had to get a T-Mobile esim