Hacker Newsnew | past | comments | ask | show | jobs | submit | omneity's commentslogin

RDMA over Thunderbolt is a thing now.

I just went through an eerily similar situation where the coding agent was able to muster some pretty advanced math (information geometry) to solve my problem at hand.

But while I was able to understand it enough to steer the conversation, I was utterly unable to make any meaningful change to the code or grasp what it was doing. Unfortunately, unlike in the case you described, chatting with the LLM didn’t cut it as the domain is challenging enough. I’m on a rabbit hunt now for days, picking up the math foundations and writing the code at a slower pace albeit one I can keep up with.

And to be honest it’s incredibly fun. Applied math with a smart, dedicated tutor and the ability to immediately see results and build your intuition is miles ahead of my memories back in formative years.


Very cool insights, thanks for sharing!

Do you have benchmarks for the SGLang vs vLLM latency and throughput question? Not to challenge your point, but I’d like to reproduce these results and fiddle with the configs a bit, also on different models & hardware combos.

(happy modal user btw)


Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram.


Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.

yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flash

Also worth noticing the size of these countries. Mostly on the small/tiny side, besides Norway (an oil exporter) and Ireland (a corporate tax haven)

Perhaps making good economic decisions grows exponentially in difficulty with the population size, especially for “conventional” economies that do not have another cash cow.


I can’t help but notice the absence of a “how shitty is this country for the average person” dimension.

E.g. which of these countries has any kind of youth culture to speak of?


Ireland, but the worsening economic inequality is wearing away at the edges of it from what I could tell on my last visit.

If you're the outdoorsy type, Switzerland.

The reality is to you get way more benefit from attracting a multinational company to shift its revenue to you for tax purposes when you're a (very) small state. This strategy simply isn't available to you when you're the US (which already has major bases for all these companies, it's just not that big a deal compared with their population and govt budget)

Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.

It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.


Agreed, the OOB experience kind of suck.

Here is the magic (assuming a 4x)...

  docker run -it --rm \
  --pull=always \
  --ipc=host \
  --network=host \
  --privileged \
  --cap-add=CAP_SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --device=/dev/mem \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /home/hotaisle:/mnt/data \
  -v /root/.cache:/mnt/model \
  rocm/vllm-dev:nightly
  
  mv /root/.cache /root/.cache.foo
  ln -s /mnt/model /root/.cache
  
  VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --quantization fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --load-format fastsafetensors \
  --enable-expert-parallel \
  --allowed-local-media-path / \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --mm-encoder-tp-mode data

Speculative decoding isn’t needed at all, right? Why include the final bits about it?


GLM 4.7 supports it - and in my experience for Claude code a 80 plus hit rate in speculative is reasonable. So it is a significant speed up.

https://omarkamali.com - I write here sometimes

https://wikilangs.org - A bunch of cool language playgrounds I'm working on


That's just an implementation artifact and not a fundamental fact of life.

https://docs.vllm.ai/en/latest/features/batch_invariance/


Do you have any evals on how good LLMs are at generating Glyphlang?

I’m curious if you optimized for the ability to generate functioning code or just tokenization compression rate, which LLMs you tokenized for, and what was your optimization process like.


Sounds like most of this is simply taking shortcuts instead of properly parsing[0].

0: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...


I'd rather view it as a failure to distinguish between data and logic. The status-quo is several steps short of where it needs to be before we can productively start talking about types and completeness.

Unfortunately that, er, opportunistic shortcut is an essential behavior of modern LLMs, and everybody keeps building around it hoping the root problem will be fixed by some silver-bullet further down the line.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: