bhyolken's comments

bhyolken · on Nov 17, 2023

Author here. This decision was more about ease of implementation than anything else. Our internal application logs were already being scooped up by GCP because we run our services in GKE, and we already had a GCP->Datadog log syncer [1] for some other GCP infra logs, so re-using the GCP-based pipeline was the easiest way to handle our application logs once we removed the Datadog agent.

In the future, we'll probably switch these logs to also go through our collector, and it shouldn't be super hard (because we already implemented a golang OTel log handler for the external case), but we just haven't gotten around to it yet.

[1] https://docs.datadoghq.com/integrations/google_cloud_platfor...

bhyolken · on Nov 16, 2023

Sure, happy to provide more specifics!

Our main issue was the lack of a synchronous gauge. The officially supported asynchronous API of registering a callback function to report a gauge metric is very different from how we were doing things before, and would have required lots of refactoring of our code. Instead, we wrote a wrapper that exposes a synchronous-like API: https://gist.github.com/yolken-airplane/027867b753840f7d15d6....

It seems like this is a common feature request across many of the SDKs, and it's in the process of being fixed in some of them (https://github.com/open-telemetry/opentelemetry-specificatio...)? I'm not sure what the plans are for the golang SDK specifically.

Another, more minor issue, is the lack of support for "constant" attributes that are applied to all observations of a metric. We use these to identify the app, among other use cases, so we added wrappers around the various "Add", "Record", "Observe", etc. calls that automatically add these. (It's totally possible that this is supported and I missed it, in which case please let me know.)

Overall, the SDK was generally well-written and well-documented, we just needed some extra work to make the interfaces more similar to the ones we were using before.

roskilli · on Nov 17, 2023

Thanks for the detailed response.

I am surprised there is no gauge update API yet (instead of callback only), this is a common use case and I don't think folks should be expected to implement their own. Especially since it will lead to potentially allocation heavy bespoke implementations, depending on use case given mutex+callback+other structures that likely need to be heap allocated (vs a simple int64 wrapper with atomic update/load APIs).

Also I would just say that the fact the APIs differ a lot to more common popular Prometheus client libraries does beg the question of do we need more complicated APIs that folks have a harder time using. Now is the time to modernize these before everyone is instrumented with some generation of a client library that would need to change/evolve. The whole idea of an OTel SDK is instrument once and then avoid needing to re-instrument again when making changes to your observability pipeline and where it's pointed. This becomes a hard sell if OTel SDK needs to shift fairly significantly to support more popular & common use cases with more typical APIs and by doing so leaves a whole bunch of OTel instrumented code that needs to be modernized to a different looking API.

arccy · on Nov 16, 2023

the official SDKs will only support an api once there's a spec that allows it.

for const attributes, generally these should be defined at the resource / provider level: https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric#WithR...

bhyolken · on Nov 16, 2023

Thanks! For logs, we actually use github.com/segmentio/events and just implemented a handler for that library that batches logs and periodically flushes them out to our collector using the underlying protocol buffer interface. We plan on migrating to log/slog soon, and once we do that we'll adapt our handler and can share the code.

caust1c · on Nov 16, 2023

Awesome! Great work and thanks for sharing your experience!

bhyolken · on Nov 16, 2023

Author here, thanks for the question! The current split developed from the personal preferences of the engineers who initially set up our observability systems, based on what they had used (and liked) at previous jobs.

We're definitely open to doing more consolidation in the future, especially if we can save money by doing that, but from a usability standpoint we've been pretty happy with Honeycomb for traces and Datadog for everything else so far. And, that seems to be aligned with what each vendor is best at at the moment.

MuffinFlavored · on Nov 16, 2023

> from the personal preferences of the engineers

https://www.honeycomb.io/pricing

https://www.datadoghq.com/pricing/

Am I wrong to say... having 2 is "expensive"? Maybe not if 50% of your stuff is going to Honeycomb and 50% going to DataDog. Could you save money/complexity (less places to look for things) having just DataDog or just Honeycomb?

bhyolken · on Nov 16, 2023

Right now, there isn't much duplication of what we're sending to each vendor, so I don't think we'd save a ton by consolidating, at least based on list prices. We could maybe negotiate better prices based on higher volumes, but I'm not sure if Airplane is spending enough at this point to get massive discounts there.

Another potential benefit would definitely be reduced complexity and better integration for the engineering team. So, for instance, you could look at a log and then more easily navigate to the UI for the associated trace. Currently, we do this by putting Honeycomb URLs in our Datadog log events, which works but isn't quite as seamless. But, given that our team is pretty small at this point and that we're not spending a ton of our time on performance optimizations, we don't feel an urgent need to consolidate (yet).

MuffinFlavored · on Nov 16, 2023

When you say DataDog for everything else (as in not traces), besides logs, what else do you mean?

claytonjy · on Nov 16, 2023

Metrics, probably? The article calls out logs, metrics, and traces as the 3 pillars of observability.

bhyolken · on Nov 16, 2023

Yeah, metrics and logs, plus a few other things that depend on these (alerts, SLOs, metric-based dashboards, etc.).

bhyolken · on April 11, 2023

I lived on that block for close to 10 years and can confirm that the safety situation around there significantly degraded over the last 1-2 years. Gangs of drug dealers moved in and took over key sidewalk areas (particularly 7th and 8th streets between Market and Mission), and violent crime and property destruction increased significantly. In my last year, there were two murders nearby (including one that I heard the gunshots for), and I witnessed a bunch of things I had never seen in the neighborhood before including trash fires, people screaming at all hours of the night, and overdose victims being revived on the sidewalk.

Although nothing bad happened to me personally, it just got really depressing and emotionally draining to live in a place where you're surrounded by so much suffering and destruction, and where you have to be hyper-vigilant every time you step outside. I packed up my things and left in December. My mental health and happiness have improved since then.

I don't blame Whole Foods and other businesses in the neighborhood for shutting down. The city leadership, including the mayor and a majority of the Board of Supervisors (SF's legislative body) really didn't seem to care one bit about the problems in that area of the city. Maybe if enough people and businesses vote with their feet, they'll be motivated to actually fix things.

bhyolken · on Jan 5, 2021

Author here. Just a few clarifications:

1. I have a ton of respect for people who do robotics work. I was trying to be a little humorous/cheeky in my descriptions here. Apologies if it comes across as flippant, that was not what I intended.

2. My undergrad was in EECS, so I know a little about the hardware side of the world (although, to be fair, I've never done it for work).

3. There's a bit more to the story than what I wrote about in my post. For reasons around confidentiality, etc. I had to focus on the things that were safe to talk about openly, some of which I agree are kind of petty. Ditto for the reasons about leaving Stripe.

bhyolken · on Aug 25, 2020

Author here, thanks for the comment. topicctl is really motivated by a desire to support rigorous, git-based topic management. The read-only views (tailing, repl, etc.) are secondary to that and are useful for our command-line-based workflows inside Segment, but are definitely not intended to replace all of the other good tooling out there, including the ones you've referenced.

bhyolken · on Aug 25, 2020

Not yet! But would be happy to look into those and/or get help adding them.