ashishbagri's comments

ashishbagri · 2025-06-30T16:36:36 1751301396

Real-time synthetic data generation with built in connectors https://github.com/glassflow/glassgen Next step is to extend it as a server module so you can run it remotely

ashishbagri · 2025-05-11T20:47:34 1746996454

- We used our custom clickhouse sink which inserts records in batches using clickhouse native protocol (as recommend by clickhouse). Each insert is done in a single transaction so if an insertion has failed, partial record do not get inserted on clickhouse. - The way the system is architected this cannot happen. If the deduplication server crashes, the entire pipeline is stopped and nothing is inserted. Currently when we read a data successfully from Kafka into our internal NATs JS, we acknowledge the new offset into Kafka. And the deduplication and insertion happens after. The limitation currently is that if our system crashes before inserting into clickhouse (but after ack to kafka) we would not process this data. We are already working towards finding a solution for this.

YZF · 2025-05-11T22:34:00 1747002840

Right. I think this is fundamental though. You can minimize the chance of duplicates but not avoid them completely under failure given ClickHouse's guarantees. Also note transactions have certain limitations as well (re: partitions).

I'm curious who your customers are. I work for a large tech company and we use Kafka and ClickHouse in our stack but we would generally build things in house.

ashishbagri · 2025-05-11T19:58:37 1746993517

Thanks for your question. In GlassFlow, we use NATs Jetstream to power deduplication (and KV store for joins as well). I see from your blog post that segment used rocksDB to power their deduplication pipeline. We actually considered using rocksDB but used NATs JS because of added complexity in scaling with rocksDB (as rocksDB is embedded in the worker process). Their indeed is a small network latency in our deduplication pipeline but our end-end latency measured is under 50ms.

caust1c · 2025-05-12T05:46:20 1747028780

Thanks for clarifying, best of luck!

ashishbagri · 2025-05-11T15:22:51 1746976971

Yes its true that if you just want to send data from Kafka to clickhouse and do not worry about duplicates, then there are several ways. we even covered them in a blog post -> https://www.glassflow.dev/blog/part-1-kafka-to-clickhouse-da...

However, the reason for us to start building this was because duplication is a sad reality in streaming pipelines and the methods to clean up duplicates on clickhouse is not good enough (again covered extensively on our blog with references to cickhouse docs).

The approach you mention about deduplication is 100% accurate. The goal in building this tool is to enable a drop-in node for your pipeline (just as you said) with optimised source and sink connectors for reliability and durability

ashishbagri · 2025-05-11T15:04:17 1746975857

Thanks for taking a look! 1. The current implementation is just for clickhouse as we started with the segment of users building real time analytics with clickhouse in their stack. However we already learned during the way that streaming deduplication is a challenge for other destination databases as well. The architecture of our tool is designed in a way that we can extend the sinks and add additional destinations. We would just have to write the sink component specific for that database. Do you have a specific DB in mind that you would like to use?

2. Again, we started with kafka because of our early target users. But the architecture inherently supports adding multiple sources. We already have experience in building multiple source and sink connectors (from our previous project) so adding additional sources would not be so challenging. which source do you have in mind?

3. Yes, running the tool locally on a macbook pro M2 docker, it was able to handle 15k requests per second. We have built a load testing infrastructure and happy to share the code if you are interested.

ashishbagri · 2025-05-11T14:51:01 1746975061

Yes it would be easily possible to configure the tool to stream directly from NATs and skip Kafka completely. The reason we started with a managed Kafka connector (via the NATS Kafka Bridge) is because most of the early users sending data to clickhouse in real time had already a Kafka in place