This is a bit off-topic and probably a noob question. But maybe someone with exp...

loudmax · on Dec 18, 2023

TheBloke's GGUF model card tells you how much RAM you'll want to run the various versions of the model: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF#provi...

In the article, Simon mentions the Q6_K.gguf model, which is about 40GB. A Mac Studio can handle this, but any of these models are going to be a tight fit or impossible on a Mac laptop without swapping to disk. Maybe NVME is fast enough that swapping isn't too terrible.

In my experience, the Mixtral models work pretty well on llama.cpp on my Linux workstation with a 10GB GPU, and offloading the rest to CPU.

It is impressive how fast the smaller models are improving. Still, a safe rule of thumb is the more RAM the better.

Also, really question how much you need to run these models locally. If you just want to play around with these models, it's probably far more cost effective to rent something in the cloud.

williamstein · on Dec 18, 2023

I tried Mixtral via ollama on my Apple M1 Max with 32GB of RAM, and was a total nonstarter. I ended up having to powercycle my machine. I then just used two L4 GPU's on Google Cloud (so 48GB of GPU RAM, see [1]) and it was very smooth and fast there.

[1] https://github.com/sagemathinc/cocalc-howto/blob/main/ollama...

jmorgan · on Dec 18, 2023

Wow, as an author of the project I'm so sorry about you having to restart your computer. The memory management in Ollama needs a lot of improvement – will be working on this a bunch going forward. I also have a M1 32GB Mac and it's unfortunately just below the amount of memory Mixtral needs to run well (for now!)

ls612 · on Dec 19, 2023

Will the Mixtral 8x7 run well on a 64GB M2 Max then?

simonw · on Dec 19, 2023

I'm using an M2 64GB and Mixtral works pretty well.

ls612 · on Dec 20, 2023

It’s wild my laptop can run AI models better than my RTX 4090 desktop. Thanks for the info!

azeirah · on Dec 18, 2023

If you want to run decently heavy models, I'd recommend getting at a minimum getting 48GB. This allows you to run 34b llama models with ease, 70b models quantized, mixtral without problems.

If you want to run most models, get 64GB. This just gives you some more room to work with.

If you want to run anything, get 128GB or more. Unquantized 70b? Check. Goliath 120b? Check.

Note that high end consumer gpus end at 24GB VRAM. I have one 7900xtx for running llms, and the best it can reliably run is 4-bit quantized 34b models, anything larger is partially in regular ram.

MilaM · on Dec 18, 2023

Thank you for this detailed response. I'm not sure if it was clear, but I was going to use just the Apple Silicon CPU/GPU, not an external one from Nvidia.

Is there anything useful you can do with 24 or 32GB of RAM with llms? Regular M2 Mac minis can only be ordered with up to 24GB of RAM. The Pro Mac mini M2 is upgradable to 32GB RAM.

robterrell · on Dec 18, 2023

I've been unable to get Mixtral (mixtral-8x7b-instruct-v0.1.Q6_K.gguf) to run well on my M2 MacBook Air (24 GB). It's super slow and eventually freezes after about 12-15 tokens of a response. You should look at M3 options with more RAM -- 64 GB or even the weird-sounding 96 GB might be a good choice.

itissid · on Dec 18, 2023

https://www.reddit.com/r/LocalLLaMA/comments/17kcgjv/how_doe... reddit thread talks about some of the pros and cons of a m3 max with 128 Gb costing ~5-6K

mkesper · on Dec 19, 2023

You can buy multiple 4090s for that money and will get real GPUs including tensor cores. Still relevant it seems: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

simonw · on Dec 18, 2023

32GB will run a good quantized Mixtral, though I can't confidently explain how much of a quality difference there is from unquantized.

cschneid · on Dec 18, 2023

Data Point: I am currently having issues getting Mixtral Q4_K_M running in LMStudio on my 32gb M1 Max. I'm trying Q3 to see if it fits.

I can have it run in on 'cpu' which is very slow, but offloading to the GPU runs out of memory.

geococcyxc · on Dec 19, 2023

You just have to allow more than 75% memory to be allocated to the GPU by running sudo sysctl -w iogpu.wired_limit_mb=30720 (for a 30 GB limit in this case).

cschneid · on Dec 19, 2023

1. That worked after some tweaking. 2. I had to lower the context window size to get LM Studio to load it up. 3. LM Studio has two distinct checkboxes that both say "Apple Metal GPU". No idea if they do the same thing....

Thanks a ton! I'm running on GPU w/ Mixtral 8x Instruct Q4_K_M now. tok/sec is about 4x what CPU only was. (Now at 26 tok/sec or so).

azeirah · on Dec 18, 2023

I was talking about m2 Macs. Just comparing that the best you can do with a gpu is 24GB, Macs go far beyond because of their integrated memory.

flux293m · on Dec 19, 2023

Just be aware you don't get to use all of it. I believe you only get access to ~20.8GB of GPU memory on a 32GB Apple Silicon Mac, and perhaps something like ~48GB on a 64GB Mac. I think there are some options to reconfigure the balance but they are technical and do not come without risk.

MilaM · on Dec 19, 2023

This is an important consideration. Thanks for mentioning it.

MilaM · on Dec 18, 2023

Ok, sorry. I did not understand that you just mentioned that to give more context. Totally makes sense.

itissid · on Dec 18, 2023

https://www.reddit.com/r/LocalLLaMA/comments/17kcgjv/how_doe... reddit thread has some ideas.

What I can draw from reading of that thread is that you can buy a Desktop Rig with 200GB memory bandwidth (comparable to m3 pro and max) and a lot of expansion capability (256GB RAM). You should find out if that's still good enough for your local use case for token per second or training.

Then just use SSH/XTerm(and possibly ngrok) to login with good speed from anywhere into your rig with a light M2 ?

simonw · on Dec 18, 2023

Get as much RAM as possible.

16GB is not enough.

32GB is enough to run quantized Mixtral, which is the current best openly licensed model.

... but who knows what will emerge in the next 12 months?

I have 64GB and I'm regretting not shelling out for more.

Frustratingly you still have WAY more options for running interesting models on a Linux or Windows NVIDIA device, but I like Mac for a bunch of other reasons.

Casteil · on Dec 18, 2023

I run 7b/13b models pretty gracefully on a 16GB M1 Pro, but it does leave me wanting a little more headroom for other things like Docker and browser eating multiple gigabytes of ram themselves.

Maybe keep an eye out for M1 / M2 deals with high ram config? I've seen 64GB MBPs lately for <$2300 (slickdeals.net)

MilaM · on Dec 18, 2023

Thank you (and all the adjacent comments). I don't really need so much RAM for my regular work. Running the llms locally would only be for fun and experiments.

I think 32GB might be the best middle ground for my needs and budget constraints.

It's really a pity that you can't extend RAM in most Apple Silicon Macs and have to decide carefully upfront.

amrrs · on Dec 18, 2023

I ended up ordering a 36GB M3 for similar reason.

I currently run Mistral and a few mistral derivatives using Ollama with decent inference speed on a 2019 Intel Mac 32GB. So I assumed the new one with 32ish should do a better job.

I've tried vision model Llava as well, a bit more latency but works fine.

With Apple's own Mlx things might improve .

eurekin · on Dec 18, 2023

Why not both? :)

Bait aside, I'd love to read about how are you using those models. I'm mostly interested in code comprehension and meeting summarisation.

simonw · on Dec 18, 2023

I still mostly use GPT-4 to get actual work done, because until a Mixtral came along it felt like the local models were no competition.

I'm going to bump up my usage of Mixtral a bit now to see how it feels for that kind of stuff.

eurekin · on Dec 18, 2023

Ah, same here. Have been itching to make some use of those local llms for some time, always ending up in chatgpt as well.

Although for those napkin like ideas gpt4 (including the turbo variant) get costly quickly.

nextworddev · on Dec 18, 2023

Just got 64gb as well but about to refund it for a bigger one after reading your comment :)

avereveard · on Dec 18, 2023

long story short, for a machine that has about 5+ years of shelf life it isn't a bad value prop to go and buy the absolute top end, especially if you expect some roi out of it, just be aware that if you don't want the top end for llm the m2 currently provides more value because m2 has memory bandwith than the m3 at the middle range.

memory bandwidth is the key to model speed, and memory size is what enable you to use larger model (quantization let you push thing further, to a point) so one thing to note is that on the M3 pro/max only the top end model gets the full bandwidh, while the m1/m2 pro enjoy full bandwidth from a smaller memory size. this may be important if you value speed above model size or vice versa. M2 Pro, M2 Max get approximately 200 GB/s and 400 GB/s, but things are more complicated for m3: M3 Pro gets 150mb/s, and M3 max gets 300mb/s at 36gb and 400mb/s at 48gb

few more things to note:

it's absolutely fine to go and play around with llm but even with a llm monster machine there's nothing wrong in starting with smaller models and learning how to squeeze the maximum amount of work out of them. the learning do transfer to larger model. this may or may not be important if at some point you'll want to monetize or deploy to production what you learned. while the mac itself is a good investment for personal use, once you move to servers, cost skyrockets with model size, because of supply constraints on 40gb+ memory gpus. if you are dependent to a 70b parameter model, you'll have a hard time to make a cost effective solution. if it's stricly to playing around, you can disregard this concern

even if you're playing around, a 70b is going to run at 7 tokens / second, which is fine for a local chat, but if you are writing a program and need inference in bulk, it's fairly slow.

another thing of note is that while the field is still undecided on which size and architecture is good enough, the moltitude of small fish experimenting with tuning and mix of instructions are largely experimenting on smaller models. currently my favorite is openhermes-2.5-mistral-7b-16k, but it's not an indication that mistrals are strictly better than llama2, more an indication that experimenting with 7b is more cost effective for third parties without access to gpu than experimenting with 13b, and so you'll find 13b model kinda stagnating, with many of them trained in a period where people didn't really know the best parameters for finetuning and are so to say a bit behind the curve. a few tuners are working at 70b models, but these seems to be pivoting to mixtrals and the likes, which will cause a similar stagnation on the top end, that is, until llama3 or the next mixtral size drops, then, who knows

ErneX · on Dec 19, 2023

All that you can afford, I got an M2 with 96GB RAM.

whiplash451 · on Dec 18, 2023

Rule of thumb: when buying a new Mac, always max RAM.

kristianp · on Dec 19, 2023

To max RAM, you have to also max the cpu. It gets super expensive .