Dbt to acquire Transform to build out its semantic layer

rwhaling · on Feb 9, 2023

Airbnb wrote a few great blog posts about why they built a standardized metrics layer - I think this one gives better technical context on the "what" and "why" than this announcement does:

https://medium.com/airbnb-engineering/how-airbnb-achieved-me...

(As the article notes, Transform's founders are Airbnb vets who worked on Minerva)

wantsanagent · on Feb 9, 2023

The interesting bit for me about Minerva (predecessor to Transform) was the engineering rigor that the pipeline allowed. It's pretty easy to write a query that brings a DB to its knees and the source controlled and reviewed intermediary steps for producing metrics seemed like crucial features. If you then have a good catalog of what metrics exist you can let BI tools or business types with their own tools mix and match performant metrics to their hearts delight.

pinkbeanz · on Feb 9, 2023

After 2 years of slow progress, dbt acquires Transform and their open-source MetricFlow to help build out the semantic layer.

CharlesW · on Feb 9, 2023

I'm curious what this enables can't be done with a thoughtfully-designed data warehouse. Looking at the "With MetricFlow" and "Without MetricFlow" examples, it seems like it's designed for people who are willing to use a new query language as long as it's not SQL?

d_watt · on Feb 9, 2023

On the one hand, the compile targets for MetricFlow and DBT are SQL. So strictly speaking, nothing can be done with these that can't be done with a data warehouse.

SQL doesn't really have a way to define new aggregations well. So taking something like Daily Active Users / Monthly Active Users (DAU/MAU), which is a semi complicated calculation using windowed data, there's no good way to have a repeatable DAU/MAU aggregation concept you could reuse at different levels without having to re-write that sql that defines the aggregation. EG DAU/MAU against the account, or DAU/MAU by user type, or DAU/MAU by region would all be different queries/reports.

Instead if you have a layer that can be given the definitions of how a particular aggregation is structured, and it knows how data is related to each other, you can ask for that metric against different concepts or levels without having to repeat yourself.

Extending that out, if BI tools can understand that layer, then you can get consistent WYSIWYG report building across multiple tools without having to redefine the calculations everywhere.

I don't think I did a great job explaining here, this keynote is a good topic survey IMO: https://www.youtube.com/watch?v=w2weqVjbUTs&feature=emb_imp_...

CharlesW · on Feb 9, 2023

Your answer was helpful, and I appreciate the video link as well. Thank you!

rwhaling · on Feb 9, 2023

In practice - complicated time series metrics, especially on top of derived temporal-logic attributes like funnels, activation, etc., are phenomenally difficult to write by hand in SQL for most analysts. We are in the process of switching to dbt metrics and it cuts the effort down for this kind of thing 5x-10x, and the SQL code dbt generates runs signficantly faster too.

qeternity · on Feb 9, 2023

How much does the dbt enterprise cost?

pinkbeanz · on Feb 9, 2023

I’ve seen reports of a few hundred dollars per seat per month. I’m sure it varies widely depending on how big the enterprise is.

agent281 · on Feb 9, 2023

Looking at the DBT semantic layer documentation I think I get it. The goal seems to be describing a metric that could be aggregated at multiple levels once instead of defining all of the aggregations separately.

For example, if you want to be able to aggregate revenue by subscription plan and by country over year, month or day, then you would have 6+ different aggregations. The more dimensions and time periods you add, the more the various combinations blow out. (I think the formula is dimension permutations multiplied by the number of time periods.) Having a single definition of the metrics prevents errors from occurring in rote implementations of the metrics. A bit like generics, but for queries.

At this point, I'm curious how they manage to make the integrations with different platforms work. (Ultimately, I think that's why they acquired Transform.)

johnito · on Feb 12, 2023

An alternative to Transform that's done an awesome job to integrate with BI tools is Cube.js Check their website https://cube.dev

pinkbeanz · on Feb 9, 2023

I think that’s still the outstanding question. Tools like Mode and Hex have already started supporting dbt’s metrics layer, but will big players like Looker and Tableau also move to support it? That’s harder to say

rwhaling · on Feb 9, 2023

One of Transform's strong suits is their Tableau integration - they have really good tooling for pushing metrics out to Tableau, rather than relying on Tableau to pull them in. Apparently AirBnb was/is still a very dedicated Tableau shop.

apahwa · on Feb 9, 2023

Minerva actually had really poor tableau support as of a year ago, that is something Transform improved on. Most of Minerva consumption was done via Superset, GoogleSheets, and email

Wonnk13 · on Feb 9, 2023

Seems like a cool place to work, but I'm always a little weary of venture backed open source companies. When the board wants a liquidity event they'll win over users every time- put more features behind a paywall or force everyone onto a SaaS platform.

Anyhoo, all pure speculation. While I've never used it, my friends have great things to say about DBT.

di456 · on Feb 9, 2023

How does this compare to multidimensional OLAP cubes? Do they solve similar problems of different time windowed views of the data?

pinkbeanz · on Feb 9, 2023

That’s right, they’re solving the same problem, although metric/semantic layers have traditionally lived within BI tools. Looker’s LookML is a great example of a metric layer.

There you define the relationships between tables, which dimensions to expose to your users, and how measures are to be aggregated along which dimensions and at what roll ups. Filters, formatting, and other semantic information rounds out what’s possible, making it easier to build complex reporting without having to pre-build large joined tables in the warehouse.

In theory a semantic layer decoupled from BI can be useful too, as downstream systems can all query a metric. Maybe an operational workflow needs access to revenue numbers or a spreadsheet wants a feed of some fresh aggregates.

The real question is will BI tools adopt a semantic layer they don’t own, giving up control over a key part of their input into their visualizations? Time will tell.

di456 · on Feb 10, 2023

Oracle BI has this separation in the RDP layer. So there is some history of BI tools separating out that layer but the separation doesn't have to result in giving up control. It's more of an organizational decision. Although it's probably not advisable to have wide open access to the semantic layer to the same users creating dashboards, as things can quickly get overcomplicated and hard to maintain.

tootie · on Feb 9, 2023

Can someone ELI5 why to use DBT instead of CREATE VIEW?

d_watt · on Feb 9, 2023

Because when you have a DAG of 100s of CREATEs that need to be run in dependency, many of which are table materializations and not views, you will end up creating a poorly specified, half baked version of DBT.

camgunz · on Feb 9, 2023

Here's why I like it:

- Dependency tracking

- Not just views, but tables, and incremental tables

- Testing framework

I've worked on a few data teams that have all been "copy/paste SQL from GitHub into the Query Editor" and it's (obviously) pretty bad. DBT is super low-lift and professionalizes your data pipeline basically instantly.

kpierce · on Feb 9, 2023

My top three:

- Dev/stag/prod env check numbers before pushing to production.

- Unions between two sources that are not the same shape can be done without the headache. https://github.com/dbt-labs/dbt-utils#union_relations-source

- Macros for common case when statements.

rm999 · on Feb 9, 2023

Create view is a great V0.5 for a data warehouse and what I recommend people do if possible so they can concentrate on building the right schema, naming standards, etc.

dbt is the V1. You get a lot of tooling, including a proper dag, logging, parametrization. You also get the ability to easily materialize your tables in a convenient format, which is important if (probably when) you figure out consistency is important. Views can take you far, but most orgs will eventually need more, and dbt is designed to be exactly that.

As a side note, moving from views to dbt is actually quite easy. I've done it several times and it's usually taken a couple of developer days to get started and maybe a couple weeks to fully transition.