For my current job, we wanted to get a mapping of stock tickers and exchanges to...

mlthoughts2018 · on Dec 4, 2018

This is why finance is lucrative, similar to esoteric codes in various types of law. Nothing to do with math models or superior prediction, just paying for someone else to fight through identifier hell, exchange protocol hell, etc., and be able to do some mickey mouse math at the end of it.

Honestly, this stuff is so bad that the headache of it might fully justify huge finance compensation, and I’ve had colleagues who turned down huge bonuses and raises to leave finance companies solely to avoid this type of stuff and seek a career where the headaches bother them less and they are paid less.

hendzen · on Dec 4, 2018

Data cleaning/transformation ends up being a huge percentage of the work in pretty much any real-world ML context I'm familiar with. Not unique to finance at all.

mlthoughts2018 · on Dec 4, 2018

I’ve worked for over a decade in industry machine learning, about half of that in quant finance. It is definitely much worse in finance than other fields.

Even medical records do not present the same degree of esoteric data formatting and mismatching. It’s not really even a matter of data cleaning. It’s that there is _no_ way to clean the data, and the only useful approach is to pay 10s of thousands of dollars to data vendors whose products have intractable errors, and then build huge data vslidation and imputation systems around it.

When it boils down to fiduciary duty to the client, and you have a contractual obligation regarding portfolio composition, then you can’t live with “good enough” data cleaning. Even one single asset with an incorrect identifier from your data vendor can cause you to e.g. invest in an Israeli company in a portfolio with a client obligation to invest in no Israeli companies (that is a real example I encountered before).

anongraddebt · on Dec 4, 2018

I come from the non-technical side of things. Do you know of any resources that would cover this issue, but for someone on the business side?

Not an engineer, so while I understand this in a general/abstract sense, my understanding is limited to, "Cleaning/transformation is messy and a time sink due to non-standardization of data."

pmart123 · on Dec 4, 2018

One good example I uncovered a while back was that Bloomberg timestamped its crude oil futures data by finding the last trade to occur in a given second and rounding down. This means that the user of the data had no idea if the price used on the 10:30:30 AM print occurred at 10:30:00.999 or 10:30:00.001. Obviously, this could create problems if thought you found a lead/lag relationship between say oil and oil stocks.

Similarly, say a vendor aggregated website visits/pageviews but didn't account for the fact that 1/3 of the traffic was coming click-bots in developing countries. If they presented you with the raw data you could figure it out and filter those countries out, but if it is aggregated, you might not discover the issue.

Then, there could be even simpler ones like determining the opening price for a stock. If say the first print of stock XYZ trades 10 shares at a price of $20, but a millisecond later, 100k shares trade at $20.11, which print should you use in your simulation algorithms as the opening print?

pmart123 · on Dec 4, 2018

Did you look at Factset's datafeed? I've found its reference data and symbology to be pretty reliable. Cusips will cost a lot with redistribution charges though. You're better off avoiding them if possible.

erichurkman · on Dec 4, 2018

Yeah, we did look at Factset. Ultimately we found repeated gaps in their symbology, since we needed a full set, including less commonly used symbols.

JaggerFoo · on Dec 4, 2018

I agree, CUSIP is also a problem for the privateer (meaning all data needs to be free to use). While I have found a way to find a mapping online, I have no idea of the accuracy and have to trust that the unaware provider QA's the data.