I've always expected some of the work in this field (and biotech) to run experim...

cookieperson · on May 22, 2023

This has already been done for years... Unironically a lot of it has also been using ML for decades. I know people in boring industries like dairy, or paint who have 30 yrs old ml models that work excellently that they maintain.

The difference is, their ML is often operating in regulated environments. Because unlike advertising people can die from mistakes. Also the data isn't cheap, can't go on Google and just download a terabyte of it. Not because scientists are sneaky. But because some experiments require five million in equipment and months to aquire. Then, the findings there don't often map to almost anything else in any way shape or form.

Statistical models in some fields have to be approved by a government agency or follow standard practices. This can take years and cost a lot of money.

ta988 · on May 22, 2023

Yes GLM, agents and Bayesian models have been used in process chemistry, food chemistry and pharma for quite a while.

dhash · on May 22, 2023

So this was Synthego’s OG thesis, but it didn’t validate in the market.

In the last 5 years, the industry has moved to using the LabCyte Echo in high-well-count plates for this kinda work. Zymergen (RIP) Amyris and Ginkgo have this scaled up to something that resembles model train layouts, where plates are shuffled between discrete workcells by little trains.

One of the challenges is the sheer volume of data — Illumina sequencers generate multi-TB files for analysis (synthetic biology context) — with most folks not having “fast datacenter networks” so overwhelmingly I see folks buying Snowballs, AWS direct connect, or running on-prem.

Industry is broadly interested in this kinda thing, with efforts like [1] [2] (me), and many many others integrating into the Design-Build-Test pipeline. Commercial MD (not necessarily only protein folding) has had a huge boost due to NN’s as well, with companies like [3] [4] cropping up in order to sell their analysis as a service.

Academia has also not been sitting idle, with labs like [5] [6] doing cool stuff

Pure, classic microfluidic setups are a huge PITA, but technologies like the Echo or [7] have the potential to change some of the unit economics.

[1] https://atomscience.org/

[2] https://radix.bio/

[3] https://deepcure.ai/

[4] https://syntensor.com/

[5] https://www.damplab.org/

[6] https://www.chem.gla.ac.uk/cronin/

[7] https://www.voltalabs.com/

jleyank · on May 22, 2023

I suspect this is the case, at least in some areas. Robotics has been used, along with other smaller-scale/faster systems, to screen corporate databases of O(million) compounds for activity vs. new targets. They've also had chip-based, multi-sensor setups for decades. Given the sheer amount of $$ floating about this business, if they can buy it, they have. As soon as it was available. Compared to the cost of failure, the cost of hardware or software is small.

And the chemical modeling researchers were playing with machine learning/neural nets in the previous century (Gasteiger, amongst others). The problem then, as now, was that the number of statistical methods to build models greatly exceeded the amount of data that was available. And even companies that have grown by acquisition (Pfizer, for example) didn't get clean data they could aggregate - and much of it was on paper.

rsfern · on May 22, 2023

This is quite the hot topic in chemistry and materials for the last few years. See https://arxiv.org/abs/2304.11120 for a current perspective-ish review

__MatrixMan__ · on May 22, 2023

I imagine a similar approach could be taken re: genomics / proteomics. Thousands of tiny bioreactors with slightly different genetics, temperature, nutrients, etc, all piped to chromatography equipment and optimizing for the metabolic pathway of some desirable. Maybe blast 'em with gamma and try to catch a lucky mutation, etc.

Edit: I'm not the only one imagining such a thing: https://www.sciencedirect.com/science/article/pii/S095816692...

BioGeek · on May 22, 2023

For a recently published example of this see [1]: an automated platform, called Self-driving Autonomous Machines for Protein Landscape Exploration (SAMPLE), can design and build proteins using AI agents and robotics. In an initial proof-of-concept, it was used to make glycoside hydrolase (sugar-cutting) enzymes that can withstand higher-than-normal temperatures.

The SAMPLE system used four different autonomous agents, each of which designed slightly different proteins. These agents search the fitness landscape for a protein and then proceed to test and refine it over 20 cycles. The entire process took just under six months. It took one hour to assemble genes for each protein, one hour to run PCR, three hours to express the proteins in a cell-free system, and three hours to measure each protein’s heat tolerance. That’s nine hours per data point! The agents had access to a microplate reader and Tecan automation system, and some work was also done at the Strateos Cloud Lab.

SAMPLE made sugar-cutting enzymes that could tolerate temperatures 10°C higher than even the best natural sequence, called Bgl3. The AI agents weren’t “told” to enhance catalytic efficiency, but their designs also had catalytic efficiencies that matched or exceeded Bgl3.

[1] https://www.biorxiv.org/content/10.1101/2023.05.20.541582v1 [2] https://www.readcodon.com/i/122504181/ai-agents-design-prote...

__MatrixMan__ · on May 22, 2023

I recently started taking biology classes, the idea being that I might like to work with systems like this (writing code that solves code problems that are tenuously linked to real problems is not going to be satisfying forever).

I'm taking bioinformatics next semester, which I hope will give me the lay of the land from a code perspective, but I really don't know what I'm getting into here.

Any advice?

ta988 · on May 22, 2023

Yes that's already existing. High throughout screening of enzymes, sequences or reactions is common.

ta988 · on May 22, 2023

The problem is that microfluidic devices are not a panacea, they usually behave really badly with some solvents or reagents and especially with polymers work they tend to clog. The transport system has the same issue you need to resuspend, evaporate, quench or other treatments and that's hard to automate especially for viscous or hard samples.

ML could assist with the definition of conditions and eventually the interpretation of the analytical data, but not at all with all the physical processing which is where the difficulty really is.

singularity2001 · on May 22, 2023

I read about a german battery research facility doing exactly that but can't recall the name.