Hacker Newsnew | past | comments | ask | show | jobs | submit | anishathalye's commentslogin

Yeah, that was just a design choice that I made: I wanted a library that worked with `Iterator`s, felt more lightweight to me / fit my immediate needs better. I'm personally not a huge fan of Pandas DataFrames for certain applications.

LOTUS (by Liana Patel et al., folks from Stanford and Berkeley; https://arxiv.org/abs/2407.11418) extends Pandas DataFrames with semantic operators, you could check out their open-source library: https://github.com/lotus-data/lotus

Semlib does batch requests, that was one of the primary motivations (I wanted to solve some concrete data processing tasks, started using the OpenAI API directly, then started calling LLMs in a for-loop, then wanted concurrency...). Semlib lets you set `max_concurrency` when you construct a session, and then many of the algorithms like `map` and `sort` take advantage of I/O concurrency (e.g., see the heart of the implementation of Quicksort with I/O concurrency: https://github.com/anishathalye/semlib/blob/5fa5c4534b91aa0e...). I wrote a bit more about the origins of this library on my blog, if you are interested: https://anishathalye.com/semlib/

ETA: I interpreted “batching” as I/O concurrency. If you were referring to the batch APIs that some providers offer: Semlib does not use those. They are too slow for the kind of data processing I wanted to do / not great when you have a lot of data dependencies. For example, a semantic Quicksort would take forever if each batch is processed in 24 hours (the upper bound when using Anthropic’s batch APIs, for example).


That was a small self-contained example that fit above the fold in the README (and fwiw even last year’s models like GPT-4o give the right output there). That `sort` is based on pairwise comparisons, which is one of the best ways you can do it in terms of accuracy (Qin et al., 2023: https://arxiv.org/abs/2306.17563).

I think there are many real use cases where you might want a semantic sort / semantic data processing in general, when there isn’t a deterministic way to do the task and there is not necessarily a single right answer, and some amount of error (due to LLMs being imperfect) is tolerable. See https://semlib.anish.io/examples/arxiv-recommendations/ for one concrete example. In my opinion, the outputs are pretty high quality, to the point where this is practically usable.

These primitives can be _composed_, and that’s where this approach really shines. As a case study, I tried automating a part of performance reviews at my company, and the Semlib+LLM approach did _better_ than me (don’t worry, I didn’t dump AI-generated outputs on people, I first did the work manually, and shared both versions with an explanation of where each version came from). See the case study in https://anishathalye.com/semlib/

There’s also some related academic work in this area that also talks about applications. One of the most compelling IMO is DocETL’s collaboration to analyze police records (https://arxiv.org/abs/2410.12189). Some others you might enjoy checking out are LOTUS (https://arxiv.org/abs/2407.11418v1), Palimpzest (https://arxiv.org/abs/2405.14696), and Aryn (https://arxiv.org/abs/2409.00847).


As you compose fuzzy operations your errors multiply! Nobody is asking for perfection, but this tool seems to me a straightforward way to launder bad data. If you want to do a quick check of an idea then it's probably great, but if you're going to be rigorous and use hard data and reproducible, understandable methods then I don't think it offers anything. The plea for citations at the end of the readme also rubs me the wrong way.


I think semantic data processing in this style has a nonempty set of use cases (e.g., I find the fuzzy sorting of arXiv papers to be useful, I find the examples in the docs representative of some real-world tasks where this style of data processing makes sense, and I find many of the motivating examples and use cases in the academic work compelling). At the same time, I think there are many tasks for which this approach is not the right one to use.

Sorry you didn't like the wording in the README, that was not the intention. I like to give people a canonical form they can copy-paste if they want to cite the work, things have been a mess for many of my other GitHub repos, which makes it hard to find who is using the work (which can be really informative for improving the software, and I often follow-up with authors of papers via email etc.). For example, I heard about Amazon MemoryDB because they use Porcupine (https://dl.acm.org/doi/pdf/10.1145/3626246.3653380). Appreciate you sharing your feelings; I stripped the text from the README; if you have additional suggestions, would appreciate your comments or a PR.


FWIW it doesn't serve as a great example because the ordering is not obvious. I think that is what GP was reacting to. When I say "sort a list of presidents by how right-leaning they are" in any other context people would probably assume the MOST right-leaning president to be listed first. It took me a moment to remember that Pythons 'sort' would be in ascending order by default.


Good point, I see how the example can be confusing. Updated the example to have `reverse=True` and a comment, hopefully that clarifies things.


Thank you for engaging with me so politely and constructively. I care probably more than I should about good science and honesty in academia, and so I feel compelled to push back in cases where I see things like: blatant overstating of capabilities, artificially farming citations.

This case seems to have been a false positive. Surely people will misuse your tool,but that's not your responsibility, as long as you haven't mislead them to begin with. Good luck with the project, I hope to someday need to cite the software myself.


For sure! I share your feelings about good science and honesty in academia :)


Hi HN!

I've been thinking a lot about semantic data processing recently. A lot of the attention in AI has been on agents and chatbots (e.g., Claude Code or Claude Desktop), and I think semantic data processing is not well-served by such tools (or frameworks designed for implementing such tools, like LangChain).

As I was working on some concrete semantic data processing problems and writing a lot of Python code (to call LLMs in a for loop, for example, and then adding more and more code to do things like I/O concurrency and caching), I wanted to figure out how to disentangle data processing pipeline logic from LLM orchestration. Functional programming primitives (map, reduce, etc.), common in data processing systems like MapReduce/Flume/Spark, seemed like a natural fit, so I implemented semantic versions of these operators. It's been pretty effective for the data processing tasks I've been trying to do.

This blog post shares some more details on the story here and elaborates what I like about this approach to semantic data processing. It also covers some of the related work in this area (like DocETL from Berkeley's EPIC Data Lab, LOTUS from Stanford and Berkeley, and Palimpzest from MIT's Data Systems Group).

Like a lot of my past work, the software itself isn't all that fancy; but it might change the way you think!

The software is open-source at https://github.com/anishathalye/semlib. I'm very curious to hear the Hacker News community's thoughts!


Implementing an ACME client is part of the final lab assignment for MIT’s security class: https://css.csail.mit.edu/6.858/2023/labs/lab5.html


Nice thanks! I’ve been wanted to learn it as dealing with cert expirations every year is a pain. My guess is that we will have 24 hour certs at some point.


I don’t know about 24 hours, but it will be 47 days in 2029.


Looks like a good class; is it only available to enrolled students? videos seem to be behind a log-in wall.


Looks like the 2023 lectures weren't uploaded to YouTube, but the lectures from earlier iterations of the class, including 2022, are available publicly. For example, see the YouTube links on https://css.csail.mit.edu/6.858/2022/

(6.858 is the old name of the class, it was renamed to 6.5660 recently.)


I see someone posted this before I was able to do it :)

Hi HN! For the last six years, I've been working on techniques to build high-assurance systems using formal verification, with a focus on eliminating side-channel leakage. I'm defending my PhD thesis next week, where I'll talk about our approach to verifying hardware security modules with proofs covering the entire hardware and software system down to the wire-I/O-level. In terms of the artifacts we verify: the biggest example is an ECDSA signature HSM, implemented in 2,300 lines of C code and 13,500 lines of Verilog, and we verify its behavior (capturing correctness, security, and non-leakage) against a succinct 50-line specification.

One of the components that I'm most excited about is how we formally define security for a system at the wire-I/O-level --- we do this with a new security definition called "information-preserving refinement," inspired by the real/ideal paradigm from theoretical cryptography.

HN has been a huge part of my life since I started undergrad about 10 years ago (I post occasionally but mostly read). I would love to see some of the HN community there, whether in-person or over Zoom --- PhD thesis defense talks are open to the public, and my talk is aimed at a general CS/systems audience!


This is really neat!

We've been working on some research to formally verify the hardware/software of such devices [1, 2]. Neat how there are so many shared ideas: we also use a PicoRV32, run on an iCE40 FPGA, use UART for communication to/from the PicoRV32 to keep the security-critical part of the hardware simple, and use a separate MCU to convert between USB and UART.

Interesting decision to make the device stateless. Given that the application keys are generated by combining the UDS, USS, and the hash of the application [3], it seems this rules out software updates? Was this an intentional tradeoff, to have a sort of "forward security"?

In an earlier project I worked on [4], we had run into a similar issue (no space for this in the write-up though); there, we ended up using the following approach: applications are _signed_ by the developer (who can use any keypair they generate), the signature is checked at application load time, and the application-specific key is derived using the hash of the developer's public key instead of the hash of the application. This does have the downside that if the developer is compromised, an adversary can use this to sign a malicious application that can leak the key.

[1]: https://github.com/anishathalye/knox-hsm [2]: https://pdos.csail.mit.edu/papers/knox:osdi22.pdf [3]: https://tillitis.se/blog/2023/03/31/on-tkey-key-generation/ [4]: https://pdos.csail.mit.edu/papers/notary:sosp19.pdf


Thank you. Interesting paper!

As you've already noted the TKey's KDF is Hash(UDS, Hash(TKey device app), USS), which means every device+application combination gets its own unique key material. As you conclude this means an update to the loaded application changes the key material, which changes any public key the application might derive. This is a hassle and not very user friendly.

However, nothing prevents the loaded application (A1) to load another application (A2) in turn. This is a key feature, as it allows A1 to define a verified boot policy of your choice. The immutable firmware would do the KDF using A1's machine code. A1 running accepts a public key, a digital signature and A2 as arguments. A1 measures the public key as context, verifies the digital signature, and then hands off its own contextualized key material to A2. In this example A1 is doing verified boot using some policy, and A2 is the application the end user uses for authentication: FIDO2, TOTP, GPG, etc.

Regarding key compromise of the developer's key you might want to look into transparency logs. Another project I'm a codesigner or is Sigsum - a transparency log with distributes trust assumptions. We recently toggled it v1, and it should be small enough to fit into a TKey application. We haven't done it yet though. Too many other things to do. :)


Very cool! That's a nice design that gives the developer the choice on the trade-off between being upgradeable and being future-proof against developer key compromise.

Transparency logs indeed are a neat ingredient to use here. I've heard of other software distributors (e.g., Firefox) using binary transparency logs but hadn't heard of anyone use them in the context of HSMs/security tokens/cryptocurrency wallets yet.


Thank you! We think so too. It is inspired by TCG DICE, which came out of Microsoft Research if I recall correctly. This approach has several other benefits as well (ownership transfer etc) which I've outlined in another comment in this thread.

Here's a cool application we've yet to make: Instead of only using the transparency log verification for the verified boot stage, use it in the signing stage as well - imagine a USB authenticator that only signs your software release if the hash to be signed is already discoverable in a transparency log. You could also rely on cosigning witnesses for secure time with distributed trust assumptions, and create policies like "only sign stuff if the current time is Monday-Friday between 09-17". That would require a challenge-response with the log though.

Regarding binary transparency I think Mozilla only considered doing it, but never actually did it. In part this was probably because CAs and CT log operators didn't want CT to be used for BT as well. Speaking of transparency, you might be interested in another project I'm involved with - System Transparency - which aims to make the reachable state space of a remote running system discoverable.


Sharing some context here: in grad school, I spent months writing custom data analysis code and training ML models to find errors in large-scale datasets like ImageNet, work that eventually resulted in this paper (https://arxiv.org/abs/2103.14749) and demo (https://labelerrors.com/).

Since then, I’ve been interested in building tools to automate this sort of analysis. We’ve finally gotten to the point where a web app can do automatically in a couple of hours what I spent months doing in Jupyter notebooks back in 2019—2020. It was really neat to see the software we built automatically produce the same figures and tables that are in our papers.

The blog post shared here is results-focused, talking about some of the data and dataset-level issues that a tool using data-centric AI algorithms can automatically find in ImageNet, which we used as a case study. Happy to answer any questions about the post or data-centric AI in general here!

P.S. all of our core algorithms are open-source, in case any of you are interested in checking out the code: https://github.com/cleanlab/cleanlab


The class website has a list of resources that includes open-source DCAI tools: https://dcai.csail.mit.edu/resources/


Hi HN! I’m back with another “what they don’t teach you in school” style course that I’d love to share with the community. (A couple years ago, I was part of the team that taught Missing Semester, an IAP class that taught programmer tools that weren’t covered in any CS courses at MIT: https://news.ycombinator.com/item?id=22226380.)

MIT, like most universities, has many courses on machine learning (6.036, 6.867, and many others). Those classes teach techniques to produce effective models for a given dataset, and the classes focus heavily on the mathematical details of models rather than practical applications. However, in real-world applications of ML, the dataset is not fixed, and focusing on improving the data often gives better results than improving the model. We’ve personally seen this time and time again in our applied ML work as well as our research.

Data-Centric AI (DCAI) is an emerging science that studies techniques to improve datasets in a systematic/algorithmic way — given that this topic wasn’t covered in the standard curriculum, we (a group of PhD candidates and grads) thought that we should put together a new class! We taught this intensive 2-week course in January over MIT’s IAP term, and we’ve just published all the course material, including lecture videos, lecture notes, hands-on lab assignments, and lab solutions, in hopes that people outside the MIT community would find these resources useful.

We’d be happy to answer any questions related to the class or DCAI in general, and we’d love to hear any feedback on how we can improve the course material. Introduction to Data-Centric AI is open-source opencourseware, so feel free to make improvements directly: https://github.com/dcai-course/dcai-course.


Many real-world datasets use multiple annotations per example to ensure higher-quality labels. CROWDLAB is a new set of algorithms that estimate 3 key quantities better than prior standard crowdsourcing algorithms like GLAD and Dawid-Skene: (1) a consensus label per example, (2) a confidence score for the correctness of the consensus label, and (3) a rating for each annotator.

The blog post gives some intuition for how it works, along with some benchmarking results, and the math and the nitty-gritty details can be found in this paper: https://cleanlab.github.io/multiannotator-benchmarks/paper.p...

Happy to answer any questions related to multi-annotator datasets or data-centric approaches to ML in general here.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: