More

omneity · 2026-01-27T13:32:56 1769520776

RDMA over Thunderbolt is a thing now.

omneity · 2026-01-22T18:59:43 1769108383

I just went through an eerily similar situation where the coding agent was able to muster some pretty advanced math (information geometry) to solve my problem at hand.

But while I was able to understand it enough to steer the conversation, I was utterly unable to make any meaningful change to the code or grasp what it was doing. Unfortunately, unlike in the case you described, chatting with the LLM didn’t cut it as the domain is challenging enough. I’m on a rabbit hunt now for days, picking up the math foundations and writing the code at a slower pace albeit one I can keep up with.

And to be honest it’s incredibly fun. Applied math with a smart, dedicated tutor and the ability to immediately see results and build your intuition is miles ahead of my memories back in formative years.

omneity · 2026-01-22T18:47:08 1769107628

Very cool insights, thanks for sharing!

Do you have benchmarks for the SGLang vs vLLM latency and throughput question? Not to challenge your point, but I’d like to reproduce these results and fiddle with the configs a bit, also on different models & hardware combos.

(happy modal user btw)

omneity · 2026-01-19T18:12:08 1768846328

Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram.

disiplus · 2026-01-19T19:37:58 1768851478

because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7

omneity · 2026-01-19T19:54:27 1768852467

Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.

disiplus · 2026-01-19T20:12:16 1768853536

yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flash

omneity · 2026-01-19T16:42:39 1768840959

Also worth noticing the size of these countries. Mostly on the small/tiny side, besides Norway (an oil exporter) and Ireland (a corporate tax haven)

Perhaps making good economic decisions grows exponentially in difficulty with the population size, especially for “conventional” economies that do not have another cash cow.

gopher_space · 2026-01-19T16:51:46 1768841506

I can’t help but notice the absence of a “how shitty is this country for the average person” dimension.

E.g. which of these countries has any kind of youth culture to speak of?

TimorousBestie · 2026-01-19T16:56:58 1768841818

Ireland, but the worsening economic inequality is wearing away at the edges of it from what I could tell on my last visit.

stackghost · 2026-01-19T17:04:49 1768842289

If you're the outdoorsy type, Switzerland.

notahacker · 2026-01-19T17:04:43 1768842283

The reality is to you get way more benefit from attracting a multinational company to shift its revenue to you for tax purposes when you're a (very) small state. This strategy simply isn't available to you when you're the US (which already has major bases for all these companies, it's just not that big a deal compared with their population and govt budget)

omneity · 2026-01-19T16:37:14 1768840634

Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.

It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.

latchkey · 2026-01-19T16:43:39 1768841019

Agreed, the OOB experience kind of suck.

Here is the magic (assuming a 4x)...

  docker run -it --rm \
  --pull=always \
  --ipc=host \
  --network=host \
  --privileged \
  --cap-add=CAP_SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --device=/dev/mem \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /home/hotaisle:/mnt/data \
  -v /root/.cache:/mnt/model \
  rocm/vllm-dev:nightly
  
  mv /root/.cache /root/.cache.foo
  ln -s /mnt/model /root/.cache
  
  VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --quantization fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --load-format fastsafetensors \
  --enable-expert-parallel \
  --allowed-local-media-path / \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --mm-encoder-tp-mode data

Der_Einzige · 2026-01-20T11:50:31 1768909831

Speculative decoding isn’t needed at all, right? Why include the final bits about it?

latchkey · 2026-01-20T17:04:53 1768928693

https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html...

foobar10000 · 2026-01-20T16:06:58 1768925218

GLM 4.7 supports it - and in my experience for Claude code a 80 plus hit rate in speculative is reasonable. So it is a significant speed up.

omneity · 2026-01-14T23:57:58 1768435078

https://omarkamali.com - I write here sometimes

https://wikilangs.org - A bunch of cool language playgrounds I'm working on

omneity · 2026-01-12T08:18:10 1768205890

That's just an implementation artifact and not a fundamental fact of life.

https://docs.vllm.ai/en/latest/features/batch_invariance/

omneity · 2026-01-11T01:51:10 1768096270

Do you have any evals on how good LLMs are at generating Glyphlang?

I’m curious if you optimized for the ability to generate functioning code or just tokenization compression rate, which LLMs you tokenized for, and what was your optimization process like.

omneity · 2026-01-08T18:41:26 1767897686

Sounds like most of this is simply taking shortcuts instead of properly parsing[0].

0: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

Terr_ · 2026-01-08T18:49:47 1767898187

I'd rather view it as a failure to distinguish between data and logic. The status-quo is several steps short of where it needs to be before we can productively start talking about types and completeness.

Unfortunately that, er, opportunistic shortcut is an essential behavior of modern LLMs, and everybody keeps building around it hoping the root problem will be fixed by some silver-bullet further down the line.