During the last few weeks I have been trying to immerse myself into the local inference AI camp. It’s time to reflect on some of that journey. No question asked – this is going to be a lifetime journey due to its nature and the way it evolves, but for now let me share the first step.
The first step is that if you are going to practice local AI inference you are going to have to spend some money. How much totally depends on you and the budget you have available. For me every $ is worth it as I see it as an investment. The picture in my head for the next 10 years is heavily AI local inference and building harness workflows around the previous layers that we used to manually code, Terraform, Python, Kubernetes …
Recommendations, Expectations and Tips.
Would not recommended buying anything below 8 GB VRAM. This would be the absolute crucial minimum in order for you to have good results and at the same time enjoy the process. In fact, if you have the $ go and buy 16 GB VRAM. The more you buy, the better you are going to be.
Would not recommend to be using Windows and WSL. My first attempts were in that camp but once I installed Kubuntu the results are better, use Linux.
Would not recommend to use some wrapper program like LM Studio. You will understand the local inference variables much better by using llama.cpp and this will pay off later on. It’s like knowing that Kubernetes is just local linux utilities working together to create an abstraction controlled by a single go binary vs viewing it as just Kubernetes.
Do not expect miracles in terms of context size. Even if you manage to push the boundaries gaining 64k context is pretty solid work, 32 is fine as well.
Do not expect miracles with t/s. This is more or less like the previous statement. Anything above 40 t/s for me is a solid track. As the engine starts producing tokens your context grows which reflects the t/s metric and it goes down.
Do not try too quickly to build harness around your AI local inference model. Harness is a dedicated learning subjourney and we are going to cover it into the next blog post.
View each model as a matrix. Dense models will activate all the matrix elements, Mixture of Experts just some of those. For now specifically for programming tasks dense models are more reliable.
Each model comes with some weights/parameters learned during pretraining. Go for Q4_K_M quantization. By default those weights are expressed in 16 bits, for local AI inference 4 bits will give you a little bit less precision, but still provide enough context.
When it comes to downloading models use https://github.com/bodaay/HuggingFaceModelDownloader. I tried some native huggging face ways for faster download speed, but the results are mixed.
If you have free time :) try to understand the self attention stage a little bit better. A lot of hardware limitations are hidden there.
That’s it, end of part 1.



Leave a comment