More

galgia · 2025-02-16T17:40:32 1739727632

Every time I start a new project I have to collect the data and guide clients through the first few weeks before I get some decent results to show them. This is why I created a libary to quickly build classic data science pipelines with LLMs, which you can use to demo any pipeline and even use it in production for non-critical use cases.

galgia · 2025-02-09T13:55:57 1739109357

Good point! LLMs are best when you are starting from point 0.

galgia · 2025-02-09T13:22:11 1739107331

Yes, LLMs are not always the best option, they are an option. Sometimes requirements of the project are such that they are also the best option.

There is one browser that uses price matching example that is impossible to do without a full-blown data science team right now: https://github.com/Pravko-Solutions/FlashLearn/tree/main/exa...

timr · 2025-02-09T13:27:14 1739107634

Inappropriate tools are always an option? I can cut a cake with a jackhammer, but....

Anyway, like I said, there are certainly good applications of LLMs, and this is probably one? I wouldn't describe "do market research on prices" as a traditional "data pipeline", but that's just me, I guess.

daxfohl · 2025-02-09T22:27:20 1739140040

I think you'd tell the LLM to design the pipeline, not be the pipeline. That way you can see exactly what it's done and tweak as needed. Plus should be way more cost effective.

galgia · 2025-02-09T13:11:33 1739106693

+ I assumed that most people will ctrl+a -> ctrl+c -> ChatGPT -> ctrl+v

tsumnia · 2025-02-09T18:55:46 1739127346

I will admit over reliance on AI is a major issue that we're coming to terms with right now. However to invoke playing devil's advocate, a person over relying on stimulants can also be a bad thing.

In moderation, AI can be fine and help. If you're assuming AI gets to do all the work while you sit around sipping mai tais and eating bonbons, you're going to have a rough time - which is exactly what we're starting to see with students that have been Copilot and GPTing through their classes. They're finally hitting the more complex stuff that needs creative thinking and problem solving skills that just aren't trained yet.

galgia · 2025-02-09T13:00:49 1739106049

I belive that LLMs will become better and better in the near future and pipelines will replace classic approaches with LLM-enriched pipelines will drastically simplify the ETL flows.

isaacremuant · 2025-02-09T14:35:10 1739111710

Not that I don't love LLMs and play with them and their potential but if we don't get proper mechanism that ensure quality and consistency then it's not really a substitute for what we have.

It's very easy to produce something that seemingly works but you can't attest to its quality. The problem is producing something resilient, that is easy to adapt and describes the domain of what you want to do.

If all these things are so great, them why do I still need to do so many things to integrate a bigtech cloud agent with popular tool? Why is it so costly or limited?

UX matters, validation matters, reliability matters, cost matters.

You can't simply wish for a problem not to happen. Someone owns the troubleshooting and the modification and they need to understand the system they're trying to modify.

Replacing scrapers with LLM is an easy and obvious thing, specially when you don't care about quality to a high degree. Other systems such as financial ones don't have that luxury.

benrutter · 2025-02-09T13:54:01 1739109241

You may be right! I guess we'll find out soon.

One thing I'd be wary of is what "LLM-enriched pipelines" look like. If it's "write a sentence and get a pipeline" then I think that does massively simplify the ammount of work, but there's another reality where people use LLMs to get more features out of existing data, rather than doing the same transformations we do now. Under that one, ETL pipelines would end up taking more time, and being more complex.

Yoric · 2025-02-09T13:37:44 1739108264

But at what cost?

We're in an energy/environmental crisis, and we're replacing simple pipelines with (unreliable) gas factories?

danielbln · 2025-02-09T15:56:51 1739116611

Cost per token has cratered a thousand percent over the last two years, and that's not just lighting VC on fire, efficiency gains are made left and right.

Yoric · 2025-02-09T19:14:18 1739128458

How much do we need to progress before it becomes comparable in terms of energy to the (often already rather energy-inefficient) data pipelines we've been using so far?

Recall that while the cost per token may decrease, CoT multiplies the number of tokens by several orders of magnitude.

galgia · 2025-02-09T13:49:20 1739108960

LLMs are not the most efficient way to solve the problem, but they can solve it.

Yoric · 2025-02-11T13:21:56 1739280116

They can do it, they're just slower, less reliable and orders of magnitude more energy-expensive.

But yes, they're potentially easier to setup.

galgia · 2025-02-09T12:58:40 1739105920

If your problem is compute, you are already optimizing. This is here for all the steps before you start thinking latency-compute. Not all use cases are made equal.

galgia · 2025-02-09T12:56:22 1739105782

I see it as a gray area - long term there will be a need for both and you will have just one tool to choose from when presented with time-budget-quality constraints.

miningape · 2025-02-09T13:01:40 1739106100

Yeah I can also see it very much depending on the demands - I'm definitely not saying every pipeline has to be the most reliable, scalable piece of software ever written.

If a small script works for you and your use case / constraints there's nothing I can say against it, but when you do grow past a certain point you'll need pipelines built in a proper way. This is where I see the increased demand since the scrappy pipelines are already proving their value.

galgia · 2025-02-09T13:48:35 1739108915

Exactly, scale after you need to.

galgia · 2025-02-09T12:54:43 1739105683

You are right! This is here to be used when your resources do not allow you to build full-blow solutions. Yes, I used LLMs to help create examples from my existing code, but they are based on things I have put in production when the client's resources were limited and wanted to move from point 0 to test out the potential of LLMs on their data.

galgia · 2025-02-09T12:29:59 1739104199

Exactly!

galgia · 2025-02-04T20:41:15 1738701675

Thank you! It took a while :)