

0·
2 months agoWhat I’ve ultimately converged to without any rigorous testing is:
- using Q6 if it fits in VRAM+RAM (anything higher is a waste of memory and compute for barely any gain), otherwise either some small quant (rarely) or ignoring the model altogether;
- not really using IQ quants - afair they depend on a dataset and I don’t want the model’s behaviour to be affected by some additional dataset;
- other than the Q6 thing, in any trade-offs between speed and quality I choose quality - my usage volumes are low and I’d better wait for a good result;
- I load as much as I can into VRAM, leaving 1-3GB for the system and context.
QWQ-32B for most questions, llama-3.1-8B for agents. I’m looking for new models to replace them though, especially the agent one.
Want to test the new GLM models, but I’d rather wait for llama.cpp to definitely fix the bugs with them first.