The Death of the Hypothesis, or, Investing in Big Data Analytics and Deep Learning

Over the last 24 months, I have been observing a profound shift in how we learn about the world. Barely noticeable at first, the newborn techniques are now roaring with multiple success cases and a promise to accelerate the pace of discovery dramatically.

The scientific method is the backbone of every scientific breakthrough we’ve achieved as humans in the last 350 years. We ask a question, do preliminary observations, formulate a hypothesis, and then run more experiments to prove or reject the hypothesis. If all experiments confirm the hypothesis, we know we are coming closer to understanding the truth behind what we observed. Or, if we reject the hypothesis, we formulate another hypothesis and begin another round of experiments. We keep iterating, in a cycle between hypothesis formulation and experimentation to confirm or reject it. The scientific method has served the humankind well, and there is little doubt that it will remain relevant and widely used for many years to come.

There is a new kid on the block now. In the last five years, the paradigm of discovery has started to change. In many fields, there is no longer any hypothesis formulation involved in the research or business decisions happening today. What’s changed is the advent of big data and deep learning algorithms used to discover patterns in massive amounts of information. Things like genetic data analysis, image classification, network intrusion and banking fraud detection, and stock trading strategies are increasingly based on computer modeling results which we do not really understand but can nevertheless use. This has a number of implications for us as investors in companies based on disruptive science and technology.

Big data analytics starts by collecting lots and lots of data. On the consumer level, our modern digital lives are generating an insane amount of data. Your mobile phone, for example, is constantly connected to the cellular network, and there is a huge amount of information about where you were, when you were, and what you were doing while you were there. Multiply that by the seven billion other cellular devices in the world, and you get a sense of what I mean by BIG data. If you add layers of photo-pictures and videos made, goods and services ordered, medical problems discovered and treated, and the genetic make-ups of the individuals involved … you get a lot of information to sort though.

The beauty of this raw, almost real-time data is that it is immensely rich in terms of variables and terrible in terms of its signal to noise. Many patterns are just too faint for any human to notice. Others are manifested as multi-variable objects that our brains cannot visualize and recognize. When you run a computer analysis on the data, the computers find patterns that we humans could not extract by ourselves. But the computers cannot explain why those patterns exist — they’re not capable of finding causality. In a rapidly growing number of cases, humans too cannot even remotely explain the discovered relations even though the computer code that identifies those patterns is relatively simple. Similar to the simple laws of evolution that resulted in an incredibly complex and rich Earth ecosystem, the simple computer code running on top of big data produces patterns of incredible [and possibly incomprehensible] complexity. There is no hypothesis being formed.

To bring this back to investing, I am working with UberSeq and EchoPixel, two companies that are using big data for life science applications. I see a big opportunity for deep learning techniques applied to biological data. Our ability to generate data from biological systems through better analytic tools and diagnostic equipment is substantially outpacing our ability to analyze it and make sense of it.

For example, the cost of genetic sequencing is dropping exponentially faster compared to the well-known Moore’s Law from the world of computers. Moore’s Law holds that every 18 months, the power of computer chips doubles while the price drops by half. It is an exponential function. Genetic sequencing is dropping in price notably faster than Moore’s Law.

The pace of this price deterioration is staggering: I am not aware of any product or service in the history of humankind that has depreciated so quickly. Somebody could write a Ph.D. thesis in Economics on how to deal with a market where prices are dropping at that speed. It just changes things. The market at the beginning of the year is totally unlike the market by the end of the year.

It also means we are about to be drowning in biological and genetic data. The time is not far off that people will be genetically sequenced as a routine part of every blood test to see the changes in which genes are over-expressed or under-expressed compared to the previous test. DNA sequencing is not the only source of this data, as people are even now wearing fitness monitors around the clock. We clearly need better ways of dealing with this data, and this is leading to concerted efforts to combine dispersed islands of knowledge under one roof. We want to make it easier for computers to analyze this biological data, both in terms to accessing the data and running learning algorithms capable of finding the patterns.

Another exciting and relevant development is our ability to run deep learning tools on top of previously disaggregated data. Historically, you might have had people with a deep knowledge of genomics and others of proteomics, and they probably worked in different buildings and never spoke with one another. As humans, we split our knowledge into vertical domains because there’s only so much you can learn as an expert before your brain becomes overwhelmed. Machines don’t have those limits. Computers don’t care if you mix proteomics, genomics, metabolomics and other omics together. The more data you can feed to the computational machines, across more domains, the more patterns they will likely identify.

In life sciences, I think we are rapidly moving into a time where we can run machines and get empirical recipes that we know will work, but we really don’t understand why they work. We are “back in the future” acting like our Bronze Age ancestors, who learned that if you throw certain stones into a hot fire, you get metal. It took another millennia before science understood why that happened. Eventually, science might catch up and develop hypotheses to explain what we learn through big data analytics, but for now, the empirical approach is leading the way.

How do these profound changes impact the opportunities for investors? That is a very good question, and I have a good working hypothesis to act upon. For now, however, I am reminded of an old proverb: “There are two rules of influence. First, you should never say everything you know.”