Qwisine14B: FINAL Evaluation of a Fine-Tuned Model on Convex
Qwisine: Early Evaluation of a Fine-Tuned Model on Convex
TL;DR
Qwisine is a fine-tuned Qwen3-14B model trained on Convex documentation and synthetic reasoning data. After a first round of training (∼700k tokens), it performs on par with GPT-4.1 across key Convex development tasks beating modells like DeepSeek R1, Grok 3 Mini (Beta), , GPT-4o etc. More practical data and larger model evaluations are planned.
Dataset & Training
The initial dataset was built from the Convex documentation, forming question–answer pairs across types like “what,” “why,” “how to,” and “task.” Each was paired with relevant context chunks.
Later Synthetic reasoning data was added using Claude 3.7-thinking to improve depth.
This first phase used ~700k tokens.
Evaluation – Phase 1
* Model: Qwisine (Qwen3-14B fine-tune)
* Evaluation scope: Core Convex categories
* Average score: 65.47%
Next Steps
* Add ~500k tokens of real-world Convex code and tasks
* Incorporate more practical, implementation-based examples
* Fine-tune and evaluate a larger 32B model
Why Convex?
I chose Convex simply because a friend uses it and talks about it all the time (ultimate convex glazer @v ). and seemed like good opportunity to learn something new.
PS: I’ll share links to Hugging Face and Datasette later — they’re still rough, and since I’m new to this and learning as I go, I’m a bit embarrassed to put them out just yet. But they’re coming!

49 Replies
evaluation was done using https://github.com/get-convex/convex-evals repo. what i noticed during testing/evaluation is that this score is not totally accurate in a lot of tests the modell simply forgot small details like package json generation, or small typo(to get perfect score)
This is fire, now we need components evals
Ai ahhhh writing
Someone loan a GPU farm?
very cool!
also "ultimate convex glazer" is a great title 😆
i'm sure @jordan has more opinions but yes the evals repo isn't perfect. in particular i think that just because a model does best in the evals doesn't mean it's necessarily ideal for Chef, where promptability is often more important than scoring
@v just collected a new role!
I'm honored 😌
Oh wow this is awesome @Mugi !!
We also noticed that AI sometimes "forgot small details like package json generation" when we were running the evals. I assumed this was because we were perhaps pushing the model to do a little too much in the task and it was reaching its limit of instruction following. For me this showed up when we were trying to do things like "tripppley nested joins" or something like that.
What sort of datasets are you looking for? We might be able to help with that? Stack posts, discord content, #ask-ai questions?
My next plan was @v Send me links to stack post and also a some demo repos so I'm planning to create second part of dataset composed of real world convex code examples by using code from repos to try to reverse the input aka possible user questions to get answer(code from repos). To further enhance models usefulness outside of documentation and real world usage. @mikeysee ask-ai thread questions would've been gold mine, you can only get so far on synthetic input so that would rly help
Tbh end goal is not modell that is perfect to answer / help with convex questions on its own, instead by combining the fine tuning modell, convex MCP and also RAG to achieve most accurate results
I was just looking at our Kapa.ai dataset and there is no export functionality. Im not sure how useful that dataset would be anyways as it seems like there are a lot of questions from people on the convex docs that are instead people super confused with what is Chef and what is an AI Docs Bot..
So we could limit it to the #ask-ai channel in which case we would need a way to use discord to export the data from that channel.
I might be able to write a script that could hit the discord API and export the data.. Ill have a think

I have just been informed that we have this site: https://discord-questions.convex.dev/ feel free to scrape that if it helps 🙂
Convex Community
Convex Community Community
Join the Convex Discord! Explore Convex, the fullstack TypeScript platform for developers and startup founders.
oh!!! its perfect thank you
well, today i will try second phase which in theory should be much, much better for real worlds tasks, correct syntax. hoping to do evaluations + build an app with the help of this itteration of the modell and compare it to chef in terms of tool calling, instruction following, etc.
new size of dataset ->
old ~700k tokens new ~3mil tokenswill run some post processing tasks before fine tuning this new dataset includes more practical examples, edge cases, real world usages, more complex inputs with code snippets simulating real user questions & tasks. Preliminary score Phase 2 ~3 tokens:
70.37% - second eval round try due an errorhad to stop training on third epoch, noticed overfitting, well have to re run and experiment. the only who didnt increase, is data modelling it actually dropped from 72% from phase 1 to 66%, every other metric inscreased noticibly. tho im hoping for even better result if i manage to finish all epochs proud top 4, if i manage to finish all epochs prob will reach top 3 and take deepseek v3 place also tested in, in editor (jetbrains+ lm studio) directly with tool calling. saw no degradation in tool calling which is fire!❤️🔥
TLDR

I think after I hopefully finish phase 2, Phase 3 will be last. Hopping for second place and I can sunset this, hobby project.
wow nice!!
Thats aweomse work, really good to know you can get super far with this
I can only imagine how good we could get if we were able to fine tune sonnet 3.5
I just came across this while browsing the discord. If you haven't already, you could use unsloth to cut down on the training requirements (almost by half), it also has ollama and vllm exporter so the model you train could directly be used by most people (if you open-source it) or used for chef easily
Yeah I will open source it today! And link huggingface to dataset, new evulation results , and gguf modell. I used axolotl initially then unsloth . Thank you
Plus not sure if convex allows for chef to use own modell/local since it ain't open source
This modell imho after new evulation is more of sidekick during convex development to draft / fix / create towards 1 task at a time
I will be also releasing in coming days a rag ready chunked dataset, that has been Enhanced, cleaned up and structured perfectly for rag so using that + Qwisine will likely achieve top performance for all purposes
https://huggingface.co/mugivara1/Qwisine
arlight bois and girls
final score
73.03%
if i had hardware to run 32B i bet my house it would be top 1 / 2 instead of top 3
Training sets & validation set
https://huggingface.co/datasets/mugivara1/convex-reasoning-new-train
https://huggingface.co/datasets/mugivara1/convex-reasoning-validation
https://huggingface.co/datasets/mugivara1/convex-reasoning

if i only had hardware to run 32b (i kinda can but its a bit of hassle)
im confident it would score first / second. 14b is most realistic to run on consumer hardware with atleast 16GB vram.
Incredible work!
amazing that you can get so far with this.
Im going to have to do a video on this I think 🙂
Hopefully I am able to run this model locally on my poor 3070!
Lol I'm not sure you can but good luck!
If I could let you borrow my 7900xt I would it's not great for AI but I think it would work
Actually we might be able to make that happen
@Mugi thoughts?
Well not right now I don't have Internet but you know once I get it
I can maybe put it on some hosting service for the video. But will have to taken down it after that
Or I could find tune a 8B version instead
Would be dumber but manageable on 3070
I was thinking tailscale
If you quantize with ollama I think you can make it fit
Q4_K would bring it down to 7 GB with acceptable performance, Q3_0 to ~5.6 GB but quality would be a bit lower
Yeah but context size
Remember if he wants to try thinking / reasoning mode. One decent request could be up to ~8k tokens
I think my brain was quantized to the max
That's not a good thing
That's the point
I ain't got no context left
Oh self glazing again
The opposite but we all know I'm a genius so no point in bringing it up
Well if Mikey wants il help him set it up, proper parameters and all nuance's
Good boy
My Internet still gone
My phone provider banned me because I bypassed the data cap
Had to get a T-Mobile esim
Oh heck yea! You even have the ollama compatible format! Definitely giving this a go
i knew waffleophagus had aura
Ooo yes please. Im a bit of a simpleton when it comes to running models locally. I can just about manage ollama tho so if we have a quantized version I could give that a crack?
If you don't mind waiting we got some bangers news , new coding-orientated open source modells this week 😱 might achieve some crazy results
Might do first coding agent for convex projects, inside ide. that u could plug into roo code, copilot etc
Good things come to those that wait so I dont mind waiting at all! 🙂
https://huggingface.co/moogin/qwisine-coder-8b-experimental
did 8b modell. non reasoning. 68.30% average score. 6 over all place in leadeboard. approximetly 50 times smaller then claude 3.5
82.77% -- data modelling.
76.14% -- fundamentals
78.87% -- queries
75.22% -- mutations
47.81% -- actions
60.17%--idioms
57.14% -- clients
now watiing for some new modell by end of this week !
No chain of thought eval
wowzers!! thats seriously impressive!
Wow!
Maybe we can try it in chef
After my exams in 10 days I'm planning to this time make much bigger dataset, including more serious dataset aiming for first spot . Not sure tho if it should be small like 8b or 14b. 32B seems most optimal but that means it can't rly run on consumer hardware if it's 32B
Lack of good data is the problem tbh
let me know if there is something I can do to help here. I dont think we have ever explicitly tried ot gather a dataset like this before
Yeah will finish my exams 3rd June and then in starting il let you know, ty for taking intress to help out
seems interesting: https://platform.openai.com/docs/guides/rft-use-cases
Yes this actually somewhat was I was planning , I'm gonna use convex-test to do RL training where I will teach modell about convex normally with SFT first so it learns about it, then il use RL bassicly a reward* function that will test convex code written by the modell based on given prompt and give it + or- minus points that will shift modells weights to learn
What this will do is teach the modell not just predict next token like normal , but it will "favour" writing code that works under "convex-test environment"
Oh perfect! Ye I am very interested to hear how that turns out, if that technique works or not