This post is a continuation of the previous installment of this mini-series. If you missed that, you should read part 1 for more context.
One of the most wonderful things about an in-person conference is the “Hallway Track,” the informal conversations happening off-schedule (e.g. during coffee breaks, dinners, etc.). Over the reception dinner of TechInnov Day – AI for Africa, I had the chance to consult Yann LeCun about the application of deep neural network (DNN) to tabular data. This is an esoteric discussion that deserve some explanation.
Ever since I join PROS in 2018, we have been actively incorporating DNN into our solutions (e.g. in our Gen IV Price Optimization). Our application of DNN is fundamentally different from the highly popularized applications in cognitive AI that aims to mimic the higher cognitive functions of humans (e.g. vision, speech, auditory processing, language, etc).
We are applying DNN to tabular data. But wait,… at the end of the day, aren’t all data just numbers and bits in some tables?
Why tabular data are challenging in machine learning
What’s so special about our “tabular data”? The difference lies in the intrinsic statistics (i.e. the correlation structures) of the data itself. The applications of DNN in cognitive AI were all applied to data that have a high level of internal correlations across the different dimensions. Since these data also have very high dimensionalities (typically from thousands to millions), we will refer to them as highly correlated high-dimension (HCHD) data (see examples below).
- Image and video data in computer vision have highly correlated spatial statistics. Seeing a few pixels in an image tells us a lot about what their neighboring pixels should be. These correlations are what allowed our brain to recognize objects in the image we are viewing and even distinguish the different styles of art. Likewise, seeing a single frame of a video would allow us to predict very well the next several frames, because they are highly correlated (i.e. not random).
- Speech and music data in auditory processing have highly correlated temporal statistics. Hearing a short tune or sequence of phonemes enables us to infer what comes next with a high degree of accuracy, because speech and music are far from random sounds. These pre-established temporal correlations are what allowed us to recognize the different genre of music and different spoken languages just from how it sounds.
- Text data in language models are not only highly correlated, but these correlation structures are also very robust within a language. This is what enables ChatGPT to generate sentences and paragraphs simply by predicting the next word, one word at a time.
Tabular data are inherently different from the HCHD data above because the correlation structures within them are basically unknown a priori. Moreover, the correlation structures of tabular data vary greatly from one dataset to the next, making it difficult to study and quantify their intrinsic statistics. Although tabular data generally have lower dimensionality than the HCHD data, the lack of pre-established internal correlations means there are fewer known behaviors or patterns that the machine can leverage. That is why, despite the huge interest, there is still no machine learning (ML) algorithm that can predict the stock market yet.
Why easy problems for human can be so difficult for machines?
Despite the popularity, today’s cognitive AIs are solving relatively simple problems that we, humans, already know how to solve. It’s easy for a 5-year-old to visually distinguish an image of a dog vs. that of a cat (even though he has never seen that specific dog or cat ever before). It’s easy for a teenager to identify all his favorite music genres just from a few seconds of listening (even though he might not be able to tell you why). And it’s relatively effortless for them to compose a few coherent sentences that are grammatically correct. These tasks are easy for us, because our brains can take advantage of the intrinsic statistics within the data. However, they were impossible for machines, because they lack the computing power to leverage all those internal correlations. But that’s changed today.
Analyzing tabular data is something that is not easy for humans, so it’s often reserved for highly specialized domain experts. Whether it’s an astronomer, a pricing analyst, a nuclear physicist, or a stock trader, these experts may eventually develop some intuition about the data they examined day after day. That means, their brains have eventually picked up the telltale correlations within the tabular data they work with for so long. If we then give these experts a new data set he has never seen before, they could often get a pretty good hunch about what might happen next (even though they might not be able to tell you why).
Although today’s cognitive AIs are only solving “simple problems” that we’ve learned to solve at a young age, they are still extremely valuable due to the speed, scale, and consistency it offers. Although we can reply to 100 emails (it’s not hard), it will take us a good part of our day, and we might not even do it with a consistent tone and style. With ChatGPT’s help, we can potentially do that in 10 min and with much greater consistency.
Now, imagine what if DNN could learn the intrinsic correlations in any generic tabular data as domain experts did with years of experience, and leverage them well.
Application of deep neural networks to tabular data is still a greenfield
Figuring out the intrinsic correlation structure in tabular data is no doubt a very challenging problem, let alone using these correlations effectively. So, I asked Yann whether there were any recent breakthroughs in this area. And I was a little surprised when Yann reaffirmed that there isn’t a lot of research applying DNN to tabular data.
This answer is both disturbing and encouraging. The lack of research can only suggest that previous research works haven’t been fruitful, therefore, no one had published anything about it. Yet, there could be 2 reasons for this:
- There isn’t any consistent correlation structure in tabular data
- The intrinsic correlation structure of tabular data is very hard to find, and no one has discovered it yet
Personally, I want to believe it’s the latter. Because human experts can sometimes develop some intuition after years of experience working with certain tabular data. This implies there must be some correlations, and the experts’ brain is able to leverage them when analyzing their domain-specific tabular data. So I take Yann’s answer positively, as it implies that the application of deep neural networks to tabular data is still a greenfield.
Next time, we’ll get into the real meat of the conference. So if you wonder what I’ve taken away from Yann LeCun’s talk and companies like DeepMind, stay tuned for my next post.