Hacker Newsnew | past | comments | ask | show | jobs | submit | suhailpatel's commentslogin

I'm in the fortunate position of having been able to tell our story in detail on our blog after a major outage involving Cassandra and bootstrap behaviour that we didn't fully understand. This is a story of how I bought down the bank for two hours.

https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...

In summary, we were scaling up our production Cassandra data store and we didn't migrate/backfill the data properly which led to data being 'missing' for an hour.

In a typical Cassandra cluster when scaled up, data moves around the ring a single node at a time. When you want to add multiple nodes, this can be an extremely time and bandwidth consuming process. There's a flag called auto_bootstrap which controls this behaviour. Our understood behaviour was that it will not join the cluster until operators explicitly signal for it to so (and this is a valid scenario because as an operator, you can potentially backfill data from backups for example). Unfortunately it was completely misunderstood when we originally changed the defaults many months prior to the scale up.

Fortunately, we were able to detect data inconsistency within minutes of the original scale up and we were able to fully revert the status of the ring to it's original state within 2 hours (it took that long because we did not want to lose any new writes so we had to carefully remove nodes in the reverse order that they came in and joined the ring).

Through a mammoth effort across the engineering team across two days, we were able to reconcile the vast majority of inconsistent data through the use of audit events.

This was a mega stressful day for everyone involved. On the plus side though, I've had a few emails telling me that the blog post has saved others from making a similar mistake.


I have a Cassandra story as well. At a previous employer our org used a database that was a custom wrapper around Cassandra. This was a fairly large organization and this particular database was the keystone to the vast majority of operations of this particular organization. Well, one day I was giving a demo to some junior devs on how to use the REST API for the database which just so happened to take in raw Solr queries. I always liked to point that out to the newer devs as a way they could do some nice things that were otherwise fairly limited by the REST API.

Well, one of the junior devs just so happened to be playing around with various different Solr queries to see what he could get back and somehow issued a query that caused the entire staging database to fall over. That was a fun phone call to get. It wasn’t the junior dev’s fault, of course, but it really did wonders to expose the fragility of poorly optimized/unindexed queries against the database.

My experience in general with Cassandra is that outside of a few experts working with it, it was pretty poorly understood throughout the org and no one except those select people could really do anything when it all fell over.


After having spent many years working with it and interacting with it deeply, I would strongly recommend folks stay far away from Cassandra if you remotely care about your data. It provides way too many footguns to lose or corrupt or outright ruin your data.

Unless you work at Apple or Netflix or Spotify, finding Cassandra experts is going to be nigh on impossible and the community just isn't there unfortunately.


That’s a great writeup, thanks for all the detail!

I was always worried about something like this happening so only ever provisioned (via ansible) one server at a time. When the logs showed it was fully synced, we provisioned the next node. It could take two days to add 10 nodes but I always felt much safer


On the cloud, it is likely simpler and faster to just spin up a new cassandra datacenter, and then do a rebuild from the old datacenter to the new datacenter, either all nodes at once in parallel or in smaller batches. This procedure works fine regardless of using static tokens allocation or vnodes, and adds very little load to the old datacenter which is still serving traffic.


This is the standard approach and the one we have detailed runbooks for. We've scaled the cluster fine one at a time after this experience. It also prompted us to get a much better understanding of all the other flags that have been changed beyond the defaults.


Two colleagues did a survey of on-call compensation: https://oncall.wtf/articles/2019-02/on-call-survey-2019.

It’s quite skewed towards the UK due to the social circles and reach but still useful nonetheless. I’m sure they would appreciate more data points!


Citymapper | Backend/iOS/Android Engineers | London, UK | ONSITE, VISA | https://citymapper.com/jobs

Cities are complicated. We use the power of mobile and open transport data to help humans survive and master them. We are building the best public transit app, one that caters for the needs of commuters. We are building a routing engine which is truly multimodal. We're running our own services to fill gaps in the transit network. To power all of this, we're leveraging open data as well as building the tools necessary for agencies to add and fix data.

We recently launched our new Smart Ride service, aimed at encouraging better shared mobility in cities. Read about The Responsive Network: https://medium.com/citymapper/the-responsive-network-part-3-.... This is super interesting and rewarding work from a technical perspective, we're constantly iterating and improving our planning, routing and simulation algorithms for Smart Ride to better serve our network. If you are interested in this sort of problem space, now is a fantastic time to get involved from the ground up.

See all our open positions at https://citymapper.com/jobs

We're hiring for Backend (Python, Go, C/C++, AWS), Frontend (Web, React, ES6) and iOS/Android engineers as well as Data Science.

Read our other blog posts at https://medium.com/@Citymapper. We've also launched an engineering blog: http://engineering.citymapper.com, if stuff like that interests you, definitely apply!

If you have any questions, feel free to drop me an email at suhail -at- citymapper -dot- com


Citymapper | Backend/iOS/Android Engineers | London, UK | ONSITE, VISA | https://citymapper.com/jobs

Cities are complicated. We use the power of mobile and open transport data to help humans survive and master them. We are building the best public transit app, one that caters for the needs of commuters. We are building a routing engine which is truly multimodal. We're running our own services to fill gaps in the transit network. To power all of this, we're leveraging open data as well as building the tools necessary for agencies to add and fix data.

We recently launched our new Smart Ride service, aimed at encouraging better shared mobility in cities. Read about The Responsive Network: https://medium.com/citymapper/the-responsive-network-part-3-.... This is super interesting and rewarding work from a technical perspective, we're constantly iterating and improving our planning, routing and simulation algorithms for Smart Ride to better serve our network. If you are interested in this sort of problem space, now is a fantastic time to get involved from the ground up.

See all our open positions at https://citymapper.com/jobs

We're hiring for Backend (Python, Go, C/C++, AWS), Frontend (Web, React, ES6) and iOS/Android engineers as well as Data Science.

Read our other blog posts at https://medium.com/@Citymapper

If you have any questions, feel free to drop me an email at suhail -at- citymapper -dot- com


Citymapper | Backend/iOS/Android Engineers | London, UK | ONSITE, VISA | https://citymapper.com/jobs

Cities are complicated. We use the power of mobile and open transport data to help humans survive and master them. We are building the best public transit app, one that caters for the needs of commuters. We are building a routing engine which is truly multimodal. We're running our own buses to fill gaps in the transit network. To power all of this, we're leveraging open data as well as building the tools necessary for agencies to add and fix data.

We recently launched our new Smart Ride service, aimed at encouraging better shared mobility in cities. Read about The Responsive Network: https://medium.com/citymapper/the-responsive-network-part-3-...

See all our open positions at https://citymapper.com/jobs

We're hiring for Backend (Python, Go, C/C++, AWS), Frontend (Web, React, ES6) and iOS/Android engineers as well as Data Science.

Read our other blog posts at https://medium.com/@Citymapper

If you have any questions, feel free to drop me an email at suhail -at- citymapper -dot- com


Citymapper | Backend/iOS/Android Engineers | London, UK | ONSITE, VISA | https://citymapper.com/jobs

Cities are complicated. We use the power of mobile and open transport data to help humans survive and master them. We are building the best public transit app, one that caters for the needs of commuters. We are building a routing engine which is truly multimodal. We're running our own buses to fill gaps in the transit network. To power all of this, we're leveraging open data as well as building the tools necessary for agencies to add and fix data.

Read our blog at https://medium.com/@Citymapper

See all our open positions at https://citymapper.com/jobs

We're hiring for Backend (Python, Go, C/C++, AWS), Frontend (Web, React, ES6) and iOS/Android engineers as well as Data Science.

If you have any questions, feel free to drop me an email at suhail -at- citymapper -dot- com


Citymapper | Backend & Mobile Engineers | London, UK | ONSITE, VISA | https://citymapper.com/jobs

Cities are complicated. We use the power of mobile and open transport data to help humans survive and master them. We are building the best public transit app, one that caters for the needs of commuters. We are building a routing engine which is truly multimodal. We're running our own buses to fill gaps in the transit network. To power all of this, we're leveraging open data as well as building the tools necessary for agencies to add and fix data.

Read our blog at https://medium.com/@Citymapper

See all our open positions at https://citymapper.com/jobs

We're hiring for Backend (Python, Go, C/C++, AWS), Frontend (Web, React, ES6) and iOS/Android engineers as well as Data Science.

If you have any questions, feel free to drop me an email at suhail -at- citymapper -dot- com


Citymapper | Software Engineers | London | ONSITE, VISA

https://citymapper.com/jobs

Cities are complicated. We use the power of mobile and open transport data to help humans survive and master them. We are building the best public transit app, one that caters for the needs of commuters. We are building a routing engine which is truly multimodal. To power all of this, we're leveraging open data as well as building the tools necessary for agencies to add and fix data.

Read our blog at https://medium.com/@Citymapper

See all our open positions at https://citymapper.com/jobs

We're hiring for Backend (Python, Go, C/C++, AWS), Frontend (Web, React, ES6) and iOS/Android engineers as well as Data Science.

If you have any questions, feel free to drop me an email at suhail -at- citymapper -dot- com


Do you have any temp/intern positions for the backend in London?


I would also be interested in this.


Citymapper | Software Engineers | London | ONSITE, VISA https://citymapper.com/jobs

Cities are complicated. We use the power of mobile and open transport data to help humans survive and master them. We are building the best public transit app, one that caters for the needs of commuters. We are building a routing engine which is truly multimodal. To power all of this, we're leveraging open data as well as building the tools necessary for agencies to add and fix data.

Read our blog at https://medium.com/@Citymapper

See all our open positions at https://citymapper.com/jobs

We're hiring for Backend (Python, Go, C/C++, AWS), Frontend (Web, React, ES6) and iOS/Android engineers as well as Data Science.

If you have any questions, feel free to drop me an email at suhail -at- citymapper -dot- com


Citymapper — London, UK | Full Time | Onsite (we offer visa support) | https://citymapper.com

Join us in our mission to make cities usable by building the ultimate transport app.

Hiring for ALL roles (Engineering, Design, Product), including:

-- Web Developer (React, Redux)

We build a lot with modern JS technologies. We have our web app, but also many systems behind the scenes that allow us to be the best source of transit data in our cities. We use React + Redux, CSS modules and PostCSS, Webpack, Django.

-- Android Developers & iOS Developers

We're particularly interested in developers who are passionate about UI, and/or using sensors & location efficiently.

-- Site Reliability Engineers

Help Citymapper scale its platform by orders of magnitude. We are currently in +30 cities, but we are going to be expanding to reach everyone who needs us.

-- Data Science

We're looking for data scientists to work on a variety of projects including improving the experience of the apps to make them more personal.

Read about our $40M Series B: https://medium.com/@Citymapper/getting-from-a-to-series-b-88...

Apply at https://citymapper.com/jobs/

Contact me at suhail at citymapper dot com if you have any questions.


Application sent.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: