Hacker Newsnew | past | comments | ask | show | jobs | submit | leiroigh's commentslogin

Yes. O(1) snapshots are awesome! Persistent datastructures are a monumental achievement.

But that comes at a performance price, and in the end, you only really need persistent datastructures for niche applications.

Good examples are: ZFS mostly solves write amplification on SSD (it almost never overwrites memory); and snapshots are a useful feature for the end user. (but mostly your datastructures live in SRAM/DRAM which permit fast overwriting, not flash -- so that's a niche application)

Another good example is how julia uses a HAMT / persistent hash-map to implement scoped values. Scoped values are inheritable threadlocals (tasklocal; in julia parlance, virtual/green thread == task), and you need to take a snapshot on forking.

Somebody please implement that for inheritable threadlocals in java! (such that you can pass an O(1) snapshot instead of copying the hashmap on thread creation)

But that is also a niche application. It makes zero sense to use these awesome fancy persistent datastructures as default everywhere (looking at you, scala!).


There is nothing counter-intuitive or julia-specific about it:

Fastest way is to have your datastructure in a (virtual) register, and that works better with immutable structures (ie memory2ssa has limitations). Second fastest way is to have your datastructure allocated on the heap and mutate it. Slowest way is to have your datastructure allocated on the heap, have it immutable, copy it all the time, and then let the old copies get garbage collected. The last slowest way is exactly what many "functional" languages end up doing. (exception: Read-copy-update is often a very good strategy in multi-threading, and is relatively painless thanks to the GC)

The original post was about local variables -- and const declarations for local variables are mostly syntactic sugar, the compiler puts it into SSA form anyway (exception: const in C if you take the address of the variable and let that pointer escape).

So this is mostly the same as in every language: You need to learn what patterns allow the current compiler version to put your stuff into registers, and then use these patterns. I.e. you need to read a lot of assembly / llvm-IR until you get a feeling for it, and refresh your feelings with every compiler update. Most intuitions are similar to Rust/clang C/C++ (it's llvm, duh!), so you should be right at home if you regularly read compiler output.

Julia has excellent tooling to read the generated assembly/IR; much more convenient than java (bytecode is irrelevant, you need to read assembly or learn to read graal/C2 IR; and that is extremely inconvenient).


Pumping heat from 300K to 900K is not a big gain over heating -- the entire thing is premised on using extremely cheap intermittent electricity during the summer, and your savings are capped at 30%.


>in a higher-dimensional parent universe

That's incorrect: The parent universe is not higher-dimensional, it's the same good old 3+1 as our universe.

What they propose is: Let's take our good old GR, and start with a (large, dilute) compactly supported spherically collapsing collapsing cloud of matter. During that, you get an event horizon; afterwards, this looks like a normal black hole outside, and you never see the internal evolution again ("frozen star", it's an event horizon). Inside, you have the matter cloud, then a large shell of vacuum, then the event horizon.

Quantum mechanics suggests that degeneracy pressure gives you an equation of state that looks like "dilute = dust" first, and at some point "oh no, incompressible".

They figure out that under various assumptions (and I think approximations), they get a solution where the inside bounces due to the degeneracy pressure. Viewed from inside, they identify that there should be an apparent cosmological constant, with the cosmological horizon somehow (?) corresponding to the BH horizon as viewed from the outside.

All along the article, they plug in various rough numbers, and they claim that our observed universe (with its scale, mass, age, apparent cosmological constant, etc) is compatible with this mechanism, even hand-waving at pertubations and CMB an-isotropies.

This would be super cool if it worked!

But I'm not convinced that the model truly works (internally) yet, too much hand-waving. And the matching to our real observed universe is also not yet convincing (to me). That being said, I'm out of the cosmology game for some years, and I'm a mathematician, not a physicist, so take my view with a generous helping of salt.

(I'm commenting from "reading" the arxiv preprint, but from not following all computations and references)

PS. I think that they also don't comment on stability near the bounce. But I think that regime is known to have BKL-style anisotropic instability. Now it may be that with the right parameters, the bounce occurs before these can rear their heads, and it might even be that I missed that they or one of their references argue that this is the case if you plug in numbers matched to our observed universe.

But the model would still be amazing if it all worked out, even if it was unstable.


> with the cosmological horizon somehow (?) corresponding to the BH horizon as viewed from the outside.

That’s not mentioned in the summary. After inflation the event horizon would not exist.


I have not really looked at the summary, opted to go straight to the source.

This identification happens in equations 31-34 on page 7f subsection "Cosmic Acceleration" in https://arxiv.org/abs/2505.23877

The justification looks super sketchy and hand-wavy to me, though, which I summarized as "somehow (?)".

"After inflation the event horizon would not exist."

Apparent cosmological constant viewed from the bouncing inside induces a cosmological horizon, which they identify with the black hole horizon viewed from the outside. Super elegant idea, but I don't buy that this is supposed to be true.


Why does this black hole bounce whilst others from the limited info we possess appear to be stable regardless of lack of singularity


The bounce is invisible from the outside -- an event horizon means causal decoupling. From outside, the formation of the black hole looks like the good old "frozen star" picture.

There will never be observational evidence on what happens on the other side of any event horizon, you'd have to cross over to the other side to see it for yourself (but you won't be able to report back your findings). There's a fun greg egan short story about that ;)


What's the story?


"The Planck dive", freely available on Greg Egan's website https://www.gregegan.net/PLANCK/Complete/Planck.html


The main problem with that is that it doesn't play nice with most languages. Consider

  int foo(int* ptr) {
    int x = ptr[1<<16];
    *ptr += 1;
    return x + ptr[1<<16];
  }

Compilers/languages/specs tend to decide that `ptr` and `ptr + (1<<16)` cannot alias, and this can be compiled into e.g.

  foo(int*):
        mov     eax, dword ptr [rdi + 262144]
        inc     dword ptr [rdi]
        add     eax, eax
        ret
which gives undesired results if `ptr` and `ptr + (1<<16)` happen to be mapped to the same physical address. This is also pretty shit to debug/test -- some day, somebody will enable LTO for an easy performance win on release builds, and bad code with a security vuln gets shipped.


I don't think that's a fundamental problem. In say Rust (with its famously strict aliasing requirements), you obviously need some level of unsafe. You certainly want to ensure you don't hand out `&mut [T]` references that alias each other or any `&[T]` references according to either virtual or physical addresses, but that seems totally possible. I would represent the ring buffer with a raw pointer and length. Then for callers I'd construct `&[T]` and `&mut [T]` regions as needed that are never more than the full (unmirrored) length and thus never include the same byte twice. There are several existing Rust crates for the mirrored buffer that (though I haven't looked into their implementations recently to verify) presumably do this: slice-deque, vmcircbuf, magic-ring-buffer, vmap.

I do think though there are some downsides to this approach that may or may not be deal-breakers:

* Platform dependence. Each of the crates I mention has a fair bit of platform-specific `unsafe` code that only supports userspace on a few fixed OSs. They fundamentally can't work on microcontrollers with no MMU; I don't think WASM has this kind of flexibility either.

* Either setting up each buffer is a bit expensive (several system calls + faulting each page) or you have to do some free-listing on your own to mitigate. You can't just rely on the standard memory allocator to do it for you. Coincidentally just like last week I was saying freelisting is super easy for video frames where you have a nice bound on number of things in the list and a fixed size, but if you're freelisting these at the library level or something you might need to be more general.

* Buffer size constraints. Needs to be a multiple of the page size; some applications might want smaller buffers.

* Relatedly, extra TLB pressure, which is significant in many applications' performance. Not just because you have the same region mapped twice. Also that the buffer size constraints mentioned above make it likely you won't use huge pages, so on e.g. x86-64 you might use 4 KiB pages rather than 2 MiB (additional factor of 512x) or 1 GiB (additional factor of 262144x) as the memory allocator would help you do if they could be stuffed into the same huge page as other allocations.


Rust doesn't help here; you necessarily must do all stores in potentially-mirrored memory as volatile (and possibly loads too), else you can have arbitrary spooky-action-at-a-distance issues, as, regardless of &[T] vs &mut [T] or whatever language-level aliasing features, if the compiler can see that two addresses are different (which they "definitely" are if the compiler, for one reason or another, knows that they're exactly 4096 bytes apart) it can arbitrarily reorder them, messing your ring buffer up. (and yes it can do so moving ops out of language-level lifetimes as the compiler knows that that should be safe)

vmcircbuf just exposes the mutable mirrored reference, resulting in [1] in release builds. Obvious issue, but, as my example never uses multiple references with overlapping lifetimes of any form, the issue would not be fixed by any form of more proper reference exposing; it's just simply the general issue of referencing to the same data in multiple ways.

vmap afaict only exposes push-back and pop-front for mutation, so unfortunately I think the distance to cross to achieve spooky action in practice is too far (need to do a whole lap around the buffer to write to the same byte twice; and critical methods aren't inlined so nothing to get the optimizer to mess with), but it still should technically be UB.

slice_deque has many open issues about unsoundness. magic-ring-buffer doesn't build on modern rust.

[1]: https://dzaima.github.io/paste/#0TVDBTsQgFLz3K56XbptsWlo1MWz...


> Rust doesn't help here; you necessarily must do all stores in potentially-mirrored memory as volatile (and possibly loads too), else you can have arbitrary spooky-action-at-a-distance issues, as, regardless of &[T] vs &mut [T] or whatever language-level aliasing features, if the compiler can see that two addresses are different (which they "definitely" are if the compiler, for one reason or another, knows that they're exactly 4096 bytes apart) it can arbitrarily reorder them, messing your ring buffer up.

Hmm, as I think about it, I see your point about LLVM's optimizer potentially "knowing" memory hasn't changed that really has if it inlines enough even if it's never put into the same &mut [T] as the other side of the mirror (and two improperly aliased &mut [T] are never constructed).

But as an alternative to doing all the stores in a special way (and loads...don't see how doing a volatile store to one side of the mirror is even sufficient to tell it the other side of the mirror has changed)...it'd be far more practical if the caller could use a (not mirrored) &mut [T]. Couldn't you have an std::ops::IndexMut wrapper that returns a guard that has a DerefMut into &mut [T] and on Drop creates a barrier for these kinds of optimizations via `std::arch::asm!("")`? [1] Then LLVM has to assume all memory changed in that barrier.

Regarding the more specific crate issues: I found these crates a while ago and hadn't looked extensively in their implementation. Thanks for pointing these out; I will have to look more closely if/when I ever decide to actually use this approach. I was leaning toward no anyway because of the other factors I mentioned. As an alternative, I was thinking of having a ring buffer + a little extra bit at the end that is explicitly copied from the start as needed. The maximum length of one message I need a contiguous view of is far less than the total buffer size, so only a fraction of the buffer would need to be copied.

> vmcircbuf just exposes the mutable mirrored reference, resulting in [1] in release builds.

Yuck, noted, clearly wrong to give the whole thing as a `&mut [T]`.

> slice_deque has many open issues about unsoundness.

I see at least couple of those, which seem to be "just" the usual unsafe-done-wrong sorts of things (double frees) rather than anything inherent to the mirrored buffer.

[1] https://stackoverflow.com/questions/72823056/how-to-build-a-...


Yeah, an asm marked as memory-clobbering is the proper thing; not the first time I've forgotten that volatile entirely doesn't imply anything to other memory. (in fact, doing "((volatile uint8_t*)x)[0] = 0xaa;" in my godbolt link in a sibling thread still has the optimization happen). Don't know how exactly it interacts with aliasing rules; maybe you'd have to explicitly pass the mutable reference to the asm as an input, otherwise it'd be illegal for the asm to change it and so the compiler can still assume it isn't? or, I guess, not have any references live during the asm call is the proper thing.

Probably indeed possible to do it with proper guards (the pre-pooping your pants issue is probably not a problem if you also have the asm guard in drop?).

> I see at least couple of those, which seem to be "just" the usual unsafe-done-wrong sorts of things (double frees) rather than anything inherent to the mirrored buffer.

Yeah, possible. I was just saying that from the perspective of proving that all the ring buffers not taking extreme care are incorrectly implemented.


> Don't know how exactly it interacts with aliasing rules; maybe you'd have to explicitly pass the mutable reference to the asm as an input, otherwise it'd be illegal for the asm to change it and so the compiler can still assume it isn't? or, I guess, not have any references live during the asm call is the proper thing.

I don't know either, but really it's the opposite half of the buffer you want to tell it may have changed, so I imagine it doesn't matter even if you still have the `&mut [T]` live.

Maybe the extra guard I described isn't necessary either; the DerefMut could directly return `&mut [T]` but set a `barrier_before_next_access` on the ring, or you could just always have the barrier, whatever performs best I guess.


>so unfortunately

I see a fellow enjoyer of bugs ;)

>vmap afaict only exposes push-back and pop-front for mutation

what about https://doc.rust-lang.org/nightly/std/io/trait.Write.html#ty... ?

>and critical methods aren't inlined

aren't inlined explicitly. This does not mean that they are not inlined in practice (depending on build options). Also, LLVM can look inside a noinline available method body for alias analysis :(

This is a big pain whenever one wants to do formally-UB shennenigans. I'm not a rustacean, but in julia a @noinline directive will simply tell LLVM not to inline, but won't hide the method body from LLVM's alias analysis. For that, one needs to do something similar to dynamic linking, with the implied performance impact (the equivalent of non-LTO static linking doesn't exist in julia).


> I see a fellow enjoyer of bugs ;)

Yep! :)

I did look at the assembly on a release build and the write method was in fact not inlined (needed to get the compiler to reason about the offset aliasing); that write method is what I called "push-back" there. I could've modified the crate to force-inline, but that's, like, effort, just to make a trivially-true assertion for one HN post.

Indeed a lack of an equivalent of gcc's __attribute__((noipa)) is rather annoying with clang (there are like at least 4 issues and 2 lengthy discussions around llvm, plus one person a week ago having asked about it in the llvm discord, but so far nothing has happened); another obvious problem being trying to do benchmarking.

(for reference, what I was trying to get to happen was an equivalent of https://godbolt.org/z/jobs6M95G)


Volatile stores would fix that issue. But it does mean that it'd be unsafe to lend out mutable references to objects. (maybe you'd need to do volatile loads too, depending on model of volatility)


That's pretty cool.

Normally it would be the either the programmer's or the compiler's job to unroll a loop and then reduce dependency chain lengths.

But its nice if the renamer can do that as well.

Presumably intel have real-world data that suggest that significant real workloads can profit from this.

I wonder whether that points to specific software issues, like hypothetically "oh yeah, openjdk8 hotspot was a little too timid at loop unrolling. It won't get that JIT improvement backported, but our customers will use java8 forever. Better fix that in silicon".


^this!

In garbage-collected languages, please give me gradual / optional annotations that permit deterministic fast freeing of temps, in code that opts in.

Basically to relieve GC pressure, at some modest cost of programmer productivity.

This unfortunately makes no sense for small bump-allocated objects in languages with relocating GC, say typical java objects. But it would make a lot of sense even in the JVM for safe eager deterministic release of my 50mb giant buffers.

Another gradual lifetime example is https://cuda.juliagpu.org/stable/usage/memory/ -- GPU allocations are managed and garbage collected, but you can optionally `unsafe_free!` the most important ones, in order to reduce GC pressure (at significant safety cost, though!).


This is very very well known. Cf https://en.wikipedia.org/wiki/Affine_group

I don't see how people should glorify this with the word "algorithm". It is a trivial undergrad homework exercise, once you give the hint "use parallel reduce / fold / prefixsum".

This may involve more interesting tradeoffs if you deal with large or sparse matrices or matrix-free operators.


If you know more than others do, that's great, but instead of posting putdowns, please share some of what you know so the rest of us can learn.

The trouble with comments like this is that they degrade discussion put others down without really teaching us anything.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

https://news.ycombinator.com/newsguidelines.html


This is awesome and the first precedent I've ever seen for a standard library doing the right thing on rand floats. Big kudos to the zig people and thanks for brightening my day!


Depends on control registers like e.g. MXCSR. It's an utter mess. Consider e.g. https://news.ycombinator.com/item?id=32738206


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: