Mugi•2w ago

Qwisine14B: FINAL Evaluation of a Fine-Tuned Model on Convex

Qwisine: Early Evaluation of a Fine-Tuned Model on Convex TL;DR Qwisine is a fine-tuned Qwen3-14B model trained on Convex documentation and synthetic reasoning data. After a first round of training (∼700k tokens), it performs on par with GPT-4.1 across key Convex development tasks beating modells like DeepSeek R1, Grok 3 Mini (Beta), , GPT-4o etc. More practical data and larger model evaluations are planned. Dataset & Training The initial dataset was built from the Convex documentation, forming question–answer pairs across types like “what,” “why,” “how to,” and “task.” Each was paired with relevant context chunks. Later Synthetic reasoning data was added using Claude 3.7-thinking to improve depth. This first phase used ~700k tokens. Evaluation – Phase 1 * Model: Qwisine (Qwen3-14B fine-tune) * Evaluation scope: Core Convex categories * Average score: 65.47% Next Steps * Add ~500k tokens of real-world Convex code and tasks * Incorporate more practical, implementation-based examples * Fine-tune and evaluate a larger 32B model Why Convex? I chose Convex simply because a friend uses it and talks about it all the time (ultimate convex glazer @v ). and seemed like good opportunity to learn something new. PS: I’ll share links to Hugging Face and Datasette later — they’re still rough, and since I’m new to this and learning as I go, I’m a bit embarrassed to put them out just yet. But they’re coming!

32 Replies

MugiOP•2w ago

evaluation was done using https://github.com/get-convex/convex-evals repo. what i noticed during testing/evaluation is that this score is not totally accurate in a lot of tests the modell simply forgot small details like package json generation, or small typo(to get perfect score)

v•2w ago

This is fire, now we need components evals Ai ahhhh writing Someone loan a GPU farm?

james•2w ago

very cool! also "ultimate convex glazer" is a great title 😆 i'm sure @jordan has more opinions but yes the evals repo isn't perfect. in particular i think that just because a model does best in the evals doesn't mean it's necessarily ideal for Chef, where promptability is often more important than scoring

Wayne•2w ago

@v just collected a new role!

v•2w ago

I'm honored 😌

mikeysee•2w ago

Oh wow this is awesome @Mugi !! We also noticed that AI sometimes "forgot small details like package json generation" when we were running the evals. I assumed this was because we were perhaps pushing the model to do a little too much in the task and it was reaching its limit of instruction following. For me this showed up when we were trying to do things like "tripppley nested joins" or something like that. What sort of datasets are you looking for? We might be able to help with that? Stack posts, discord content, #ask-ai questions?

MugiOP•2w ago

My next plan was @v Send me links to stack post and also a some demo repos so I'm planning to create second part of dataset composed of real world convex code examples by using code from repos to try to reverse the input aka possible user questions to get answer(code from repos). To further enhance models usefulness outside of documentation and real world usage. @mikeysee ask-ai thread questions would've been gold mine, you can only get so far on synthetic input so that would rly help Tbh end goal is not modell that is perfect to answer / help with convex questions on its own, instead by combining the fine tuning modell, convex MCP and also RAG to achieve most accurate results

mikeysee•2w ago

I was just looking at our Kapa.ai dataset and there is no export functionality. Im not sure how useful that dataset would be anyways as it seems like there are a lot of questions from people on the convex docs that are instead people super confused with what is Chef and what is an AI Docs Bot.. So we could limit it to the #ask-ai channel in which case we would need a way to use discord to export the data from that channel. I might be able to write a script that could hit the discord API and export the data.. Ill have a think

mikeysee•2w ago

I have just been informed that we have this site: https://discord-questions.convex.dev/ feel free to scrape that if it helps 🙂

Convex Community

Convex Community Community

Join the Convex Discord! Explore Convex, the fullstack TypeScript platform for developers and startup founders.

MugiOP•2w ago

oh!!! its perfect thank you well, today i will try second phase which in theory should be much, much better for real worlds tasks, correct syntax. hoping to do evaluations + build an app with the help of this itteration of the modell and compare it to chef in terms of tool calling, instruction following, etc. new size of dataset ->

old ~700k tokens new ~3mil tokens

will run some post processing tasks before fine tuning this new dataset includes more practical examples, edge cases, real world usages, more complex inputs with code snippets simulating real user questions & tasks. Preliminary score Phase 2 ~3 tokens:

70.37% - second eval round try due an error

had to stop training on third epoch, noticed overfitting, well have to re run and experiment. the only who didnt increase, is data modelling it actually dropped from 72% from phase 1 to 66%, every other metric inscreased noticibly. tho im hoping for even better result if i manage to finish all epochs proud top 4, if i manage to finish all epochs prob will reach top 3 and take deepseek v3 place also tested in, in editor (jetbrains+ lm studio) directly with tool calling. saw no degradation in tool calling which is fire!❤️‍🔥

MugiOP•2w ago

TLDR

MugiOP•2w ago

I think after I hopefully finish phase 2, Phase 3 will be last. Hopping for second place and I can sunset this, hobby project.

mikeysee•7d ago

wow nice!! Thats aweomse work, really good to know you can get super far with this I can only imagine how good we could get if we were able to fine tune sonnet 3.5

radix•5d ago

I just came across this while browsing the discord. If you haven't already, you could use unsloth to cut down on the training requirements (almost by half), it also has ollama and vllm exporter so the model you train could directly be used by most people (if you open-source it) or used for chef easily

MugiOP•3d ago

Yeah I will open source it today! And link huggingface to dataset, new evulation results , and gguf modell. I used axolotl initially then unsloth . Thank you Plus not sure if convex allows for chef to use own modell/local since it ain't open source This modell imho after new evulation is more of sidekick during convex development to draft / fix / create towards 1 task at a time I will be also releasing in coming days a rag ready chunked dataset, that has been Enhanced, cleaned up and structured perfectly for rag so using that + Qwisine will likely achieve top performance for all purposes https://huggingface.co/mugivara1/Qwisine arlight bois and girls

MugiOP•3d ago

https://i.imgur.com/AxRWCK3.png

Imgur

MugiOP•3d ago

final score 73.03% if i had hardware to run 32B i bet my house it would be top 1 / 2 instead of top 3 Training sets & validation set https://huggingface.co/datasets/mugivara1/convex-reasoning-new-train https://huggingface.co/datasets/mugivara1/convex-reasoning-validation https://huggingface.co/datasets/mugivara1/convex-reasoning

MugiOP•3d ago

if i only had hardware to run 32b (i kinda can but its a bit of hassle) im confident it would score first / second. 14b is most realistic to run on consumer hardware with atleast 16GB vram.

mikeysee•19h ago

Incredible work! amazing that you can get so far with this. Im going to have to do a video on this I think 🙂 Hopefully I am able to run this model locally on my poor 3070!

v•15h ago

Lol I'm not sure you can but good luck! If I could let you borrow my 7900xt I would it's not great for AI but I think it would work Actually we might be able to make that happen @Mugi thoughts? Well not right now I don't have Internet but you know once I get it

MugiOP•15h ago

I can maybe put it on some hosting service for the video. But will have to taken down it after that Or I could find tune a 8B version instead Would be dumber but manageable on 3070

v•14h ago

I was thinking tailscale

radix•12h ago

If you quantize with ollama I think you can make it fit Q4_K would bring it down to 7 GB with acceptable performance, Q3_0 to ~5.6 GB but quality would be a bit lower

MugiOP•12h ago

Yeah but context size Remember if he wants to try thinking / reasoning mode. One decent request could be up to ~8k tokens

v•12h ago

I think my brain was quantized to the max

MugiOP•12h ago

That's not a good thing

v•12h ago

That's the point I ain't got no context left

MugiOP•12h ago

Oh self glazing again

v•12h ago

The opposite but we all know I'm a genius so no point in bringing it up

MugiOP•12h ago

Well if Mikey wants il help him set it up, proper parameters and all nuance's

v•11h ago

Good boy My Internet still gone My phone provider banned me because I bypassed the data cap Had to get a T-Mobile esim

Qwisine14B: FINAL Evaluation of a Fine-Tuned Model on Convex

Did you find this page helpful?