Data-Centric AI and How to Adopt This Approach

Interview Eikku Koponen and Jean-Emmanuel Wattier

If you follow the big names of the industry, you have probably noticed the competition on Data-Centric AI by Andrew Ng that trended this year. We, Valohai and Ingedata are so glad that there is finally proper focus on the data, its validity, and reliability, after a decade of hype, first on BIG data and then the machine learning models and AI systems. Everybody knows this; data and especially its quality is what matters. Most of the datasets aren’t that big, and good old logistic regression will do the magic most of the time yielding explainable results.

What is data-centric AI?

“Data is food for AI” is a quote from Andrew Ng used in many posts and materials this year. He means that what you train the model with is what the model can actually do – garbage in, garbage out, if you will. This is tightly related to the discussion on ethics; whether your model is biased or not is based on your training data and whether it is so on purpose. And tied to the fact that the data you have, is, if not the most, at least close to the most valuable asset you’ve got when creating AI systems.

"What we’re missing is a more systematic engineering discipline of treating good data that feeds A.I. systems,” Ng said. “I think this is the key to democratizing access to A.I.” – Fortune, November 8th, 2021

In addition to the relevance of the data, Data Scientist spends most of their time on data preparation-related tasks according to multiple surveys (Forbes & Datanami) and also according to my own experience. The focus in research and topics discussed around AI and machine learning should be proportionately this way as well. But it is not. Only 1 percent of the AI-research focuses on data.