It seems most open source models somehow rely on Meta’s LlaMA models. I requested access to the models and got a link to use with their download.sh tool. To do this, you have to clone https://github.com/facebookresearch/llama. Then you need to run the downloader, but…
I started with https://github.com/ggerganov/llama.cpp because it made all the news a while ago. llama.cpp allows running inference on GPU and supports LlaMA its derivatives (e.g., Vicuna). That said, the UI is not nice and I couldn’t find an API solution either. I was also not able to run Vicuna in llama.cpp, despite what the author claimed. To use Vicuna and other LlaMA-based models, you need the right quantized versions of those models. I found some on HuggingFace, but they’re all outdated thanks to the fact that llama.cpp’s developer decided to use other quantization techniques. I realized using llama.cpp directly will not be a reliable solution in the mid- to long-term. Moving on…
Since I wanted API access, I found https://github.com/abetlen/llama-cpp-python that provides Python bindings for llama.cpp. It worked fine and I was able to use it in my Python programs, but it was much, much slower than llama.cpp. People have complained about it in the Issues. Someone mentioned that it could be due to the Python architecture on M1 Macs. Here I explain this:
Typically, when you create an Conda environment, the Python architecture will be x86_64. You can check this:
🚀 python -c 'import platform; print(platform.platform())' 09:54:56
macOS-13.3.1-x86_64-i386-64bit
On M1 Macs, though, you need ARM architecture or else you’ll experience speed issues. According to this SO, here’s how you can fix this:
CONDA_SUBDIR=osx-arm64 conda create -n my_env python → makes an ARM environment
CONDA_SUBDIR=osx-64 conda create -n my_env python → makes an x64 environment
Now we get:
🚀 python -c 'import platform; print(platform.platform())' 09:54:08
macOS-13.3.1-arm64-arm-64bit
For more information, please this and this.
But even after taking care of my Python architecture, I ended up not using llama-cpp-python because of langchain…