jimbocoin 🃏 @jimbocoin - 5mo
Learning more about running my own LLM models at home. Apparently, the quantization method impacts performance differently on different kinds of hardware. This is why, if you’re browsing models on Hugging Face, you’ll see files with suffixes like “Q3_K_S” and “IQ2_XXS”. The number after the “Q” tells you which quantization method the model uses. Some will be much slower than others depending on the capabilities of the CPU and GPU in the machine. #llm
Ah but it does! Once you download the gguf file from Hugging Face, you can use ollama’s create command, passing in a Modelfile that specifies the path the the gguf. Then you can use ollama run to start up the model. It’s kinda annoying but there are instructions online: https://www.markhneedham.com/blog/2023/10/18/ollama-hugging-face-gguf-models/ I used this technique to run mradermacher/dolphin-2.9.2-mixtral-8x22b-GGUF: https://huggingface.co/mradermacher/dolphin-2.9.2-mixtral-8x22b-GGUF
No, sorry, that’s the file extension. For example, this page has some large *.gguf files split into parts (because Hugging Face has a max upload size of 50GB): https://huggingface.co/mradermacher/dolphin-2.9.2-mixtral-8x22b-GGUF Once you download the two parts, you can combine them into the single *.gguf file that ollama is able to import. Instructions for combining the part files can be found here: https://huggingface.co/TheBloke/KafkaLM-70B-German-V0.1-GGUF
Yeah, I believe there are tools that can covert them but I haven’t tried. Once I found that there were already gguf files for the models I wanted to run, I just used those. If you try the conversion tools, let me know how it goes!