Top 5 Trends in Alternative Data

We are in the fortunate position here at Quandl to bear firsthand witness to the evolution of alternative data; interactions with clients, data providers, and immersion in the space combine to give us a unique perspective. From our standpoint, alternative data is growing unabated, although innovation is perhaps slower than some would like. Here we catalog current trends.

1. Welcome to the trough of disillusionment

If you’re not familiar with the hype cycle, it exists to describe the maturity, adoption and social application of specific technologies. It exists because history constantly repeats itself and alt data is no exception. We’ve all seen the headlines: “there’s no gold to be found in alt data,” or “we looked at 700 datasets and didn’t find anything.” This is unsurprising and necessary for us to get to the slope of enlightenment. Alternative data is being used by many early adopters, and successfully, but it is not experiencing the stratospheric growth many expected by now. Put another way: the early adopters will continue to reap the rewards of alt data while the majority plays catch up.

Sophisticated quantitative and quantamental funds have figured out that the low-hanging fruit has been picked, and they are moving on to the second wave of strategies. They are discovering use cases for which alternative data is better designed than traditional data, according to these funds:

Those who think creatively about how best to use this data will be the winners – but of course that’s dogma when it comes to the hype cycle.

Hype CycleAdapted from

2. Alt data is made, not found

The days of finding a dataset in the wild that instantly translates to trading profits are over — if they ever really began. Non-market data is rarely – if ever – ready-to-use. Investors and data providers now realize that maximizing value from alternative data requires substantial processing, and this requires pulling together multiple sources of non-market, reference, and other data. For example:

      • Entity resolution from products to brands to companies to tickers must be layered in to any dataset for equity-oriented clients.
      • Third party taxonomy data must be used to classify products, transactions, or other entries.
      • Layering financial or risk data over an alternative dataset to create a composite is becoming more popular.
      • Machine learning is often called upon to classify raw data into something usable.

There is no brochure-ware here. Like anything else, smart products that meet investors’ needs must be built. This requires not only blending multiple datasets together – an arduous task in and of itself – but also building the technology to turn this into a repeatable, sustainable process.

3. Man: 1, Machine: 0.5

The future is here; it’s just not evenly distributed” – William Gibson

We are seeing increasing maturity around where to apply machine learning in the industry, and where not to. Initially, a number of funds and banks applied ML techniques to the “glamorous” side of investing: trade execution and strategy design. This turned out to be less productive than expected. While ideas and innovation abound, the returns have been slow to accrue. Michael Steliaros from Goldman Sachs reflects much of the view of industry when he says the major impact of ML has been on Goldman’s “advertising material”.

But perhaps that is looking at the wrong stage of the pipeline. Where ML really shines is in handling large, messy, noisy, gap-filled, ambiguous, unstructured or semi-structured data. And the data explosion of recent years has created enormous amounts of precisely that. We’ve seen funds effectively use or incorporate ML into their workflows, not at the portfolio design stage, but at the data preparation stage.

For example: many alternative datasets make reference to “real-world” objects: store locations, or brands, or websites, or container ships. For this data to be usable, these objects have to be mapped (“resolved”) to the economically-relevant entities (companies), and then to tradable securities (tickers). The data is often ambiguous, incomplete and noisy: in other words, perfect raw material for ML.

More specific use cases also exist. Computer vision can be used to convert raw satellite images into car or ship counts. Fuzzy mappings can disambiguate transaction data. NLP can find sub-conscious “tells” in the language of earnings calls.

In none of these cases is the ML recommending a specific investment decision; that part of the process is still left to the human analyst or quant researcher or PM, later in the pipeline. But ML helps these end users grapple with datasets that are too big and too messy to work with directly.

4. The end and the beginning of anonymity

Misuse of consumer data — by anyone — hurts everyone. As more privacy scandals and data breaches come to light, more concern over access to consumer data hits the media, and more lawyers get worried.

While the adage that “investors don’t care about an individual’s data” is true, some firms shy away from certain types of data altogether because of its inability to be anonymized. This is perhaps more due to headline risk than anything else.

The irony is there is still little to no case law or published guidance on the subject. And no established legal or regulatory framework.

While investors don’t need anonymization per se, because the data is only useful in the aggregate, they must ensure absolute anonymity through their agreements with vendors and in the data itself.

So we are seeing large-scale investment from both providers and practitioners in privacy technologies such as differential privacy, synthetic data generation, and data blending. Although there are challenges to these approaches, it appears that the ability to use data safely is becoming an industry unto itself.

5. The pricing solar system

The best alternative data is that which is closest to the sun. For example, imagine you are trying to capture sales of iPhones. The best possible source would be first-party data from Apple (which you are not likely to get). The signal from that data would give you a lot of insight into Apple’s stock price.

The second best would be e-receipt data because you’re seeing SKU-level transactions. Third might be credit card data, which might not give you product information, but will give you company data. Fourth could be footfall data to Apple stores. Fifth: cars parked outside malls that have Apple stores. And perhaps people posting on Twitter #lovemyiphone counts for something too.

The solar pricing system

This concept of data seniority is directly related to the pricing of alternative datasets. As expected, data closer to the sun is more valuable. And that is the current trend driving the pricing of alternative datasets.

Price by Dataset TypeSource: Neudata

This is also one of the reasons transaction-level data is so valuable and in our experience the most often-requested type of data. As it matures, customers are asking more specific questions about features, indicators, and other aspects of this quickly-diffusing alternative data source.

There is no doubt the ecosystem will continue to mature. How fast is anyone’s guess. The ROI calculation is simple: if investors can drive profits from alternative data they will adopt it. Today, some of it is useful; some of it is no better than the market data that has been in use for years. These mixed results can be confusing for some. For others, they are a clear opportunity.