Run a small local model locally

If you are into running local LLM models you have came across llama.cpp – basically your best friend in optimizing a model to run locally on your pc. In our case we have a laptop which runs with ~= 16 GB RAM, integrated GPU AMD Ryzen 7000 serious.

There is a GPU built into the computer, but in any case very tightly coupled with the CPU so they form one singular unit. CPU + iGPU share the same memory that we have.

This is what we initially have

			
free -h
               total        used        free      shared  buff/cache   available
Mem:            13Gi       4,9Gi       4,1Gi       232Mi       5,1Gi       8,5Gi
Swap:          511Mi          0B       511Mi

How llama.cpp help us in our case ?

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware – locally and in the cloud.

LLM inference – model being asked and generating a token.
AVX, AVX2, AVX512 and AMX support for x86 architectures – this is like additional power for your CPU. Think about it as a way to actually handle multiple numbers at once vs a single number, when we are doing computations. In our case we have avx2.

			
lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      8
  On-line CPU(s) list:       0-7
Vendor ID:                   AuthenticAMD
  Model name:                AMD Ryzen 3 7320U with Radeon Graphics
    CPU family:              23
    Model:                   160
    Thread(s) per core:      2
    Core(s) per socket:      4
    Socket(s):               1
    Stepping:                0
    Frequency boost:         enabled
    CPU(s) scaling MHz:      35%
    CPU max MHz:             4151,7300
    CPU min MHz:             425,1780
    BogoMIPS:                4791,47
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_a
                             picid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs s
                             kinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_
                             ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pause
                             filter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca

		

1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use – we are going to reduce the precision of the model a little bit, so that we could run it locally.
llama-server – A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

All right, let’s get back to work.

Clone the repo ( this might take some time ).

			
git clone https://github.com/ggml-org/llama.cpp.git
Cloning into 'llama.cpp'...
remote: Enumerating objects: 85303, done.
remote: Counting objects: 100% (74/74), done.
remote: Compressing objects: 100% (48/48), done.
remote: Total 85303 (delta 41), reused 26 (delta 26), pack-reused 85229 (from 2)
Receiving objects: 100% (85303/85303), 338.96 MiB | 1.74 MiB/s, done.
Resolving deltas: 100% (61591/61591), done.

		

2. Build the llama.cpp ( in our case for CPU backend, in your case it might a different backend ), also make sure you have cmake installed.

https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

Initially I had this error, but after install build-essential it was fixed.

			
cmake -B build -DCMAKE_BUILD_TYPE=Debug
-- The C compiler identification is GNU 15.2.0
-- The CXX compiler identification is unknown
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
CMake Error at CMakeLists.txt:2 (project):
  No CMAKE_CXX_COMPILER could be found.
  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.

		

Leave a comment Cancel reply

Join the club

Categories

Recent Posts

LLAMA.cpp and its friendlies to nonrich people

Migration from ingress to gateway api inside AKS

The ‘pseudo’ file system

LLAMA.cpp and its friendlies to nonrich people

Share this:

Leave a comment Cancel reply

Join the club

Categories

Recent Posts

LLAMA.cpp and its friendlies to nonrich people

Migration from ingress to gateway api inside AKS

The ‘pseudo’ file system