The Center for Security and Emerging Technology (CSET) at Georgetown University just published a report, titled “Small data’s big AI potential”. “Small data” means “the ability of machine learning or AI systems to learn from small training data sets”.
An overemphasis on data ignores the existence–and underestimates the potential—of several AI approaches that do not require massive labeled datasets.From “Small Data’s Big AI Potential”, CSET Issue Brief, by Husanjot Chahal, Helen Toner, Ilya Rahkovsky
And back in 2015, Peter Sweeney argued that “big data is plagued with small data problems”. While we do have an ever-increasing amount of data from sensors, user behavior, machines, human-created content and others, “with respect to any specific individual or interest, the data is often sparse”.
By the way, by “labeled data”, I mean data that tell a machine what to do. For example, if you want to train a machine to recognize dogs, you’d need pictures (or sounds or smells) of dogs, labeled “dog”, and pictures (or sounds or smells) of “non-dogs”, labeled “not a dog”. Then the machine can use these labeled data, and learn to “recognize” feature patterns that distinguish dogs from non-dogs (note that the machine has to be built such that it can “recognize” these features).
If you can’t work with small data sets, you lose out on attractive market opportunities
Here’s the rub: While we do have tons of data, these data very often consist of many small data sets. And there are many attractive market opportunities that are rife with these small data sets.
For example, consider content personalization: Yes, people generate lots of data all the time. This includes where they click, or what they read, for instance. But we typically have very little individual-level data. Just think about the number of news articles you read per week. 30 perhaps? Maybe more. But it is certainly not 10,000. So if an algorithm is supposed to learn how your interests change from week to week, this algorithm has very little data to learn from.
There are many other market opportunities for small data. For example, Peter Sweeney’s article mentions social interest networks, local ecommerce, and hyper-targeting in marketing and advertising. You can probably add to that list personalized medicine, materials discovery, robotics, and in some cases machine translation as well. I will come back to some of these topics below, when we look at concrete examples.
This is the “opportunity part”. But there’s also a “no other choice part”.
You may have no other choice than working with small data
Not enough resources for data labeling
Google can afford to hire thousands of people for data labeling. So can China, for example. But what if you can’t?
Privacy and data silos
Privacy regulations and concerns may prevent you from collecting and storing large amounts of data. For example, in medical or legal applications, you can’t simply access and store as much data as you want about patients.
Data silos is when you do have lots of data, but distributed and secluded across different repositories. Data in companies or other organizations are often siloed. Sometimes for good reasons such as security and privacy, sometimes for less good reasons, such as office politics. In any case, these data siloes mean that rather than learning from one big data set, you have to learn from many small data sets–which is quite different.
Insufficient access to powerful machine learning hardware
For training big, parameter-rich machine learning models, you need powerful computing resources. Similar to data labeling, big organizations can afford such resources, but what if you can’t? Not to mention the current chip shortage. This might make it difficult even for large organizations to obtain the hardware resources that are required for training large-scale machine learning models.
Related: Software may now be eating AI chips too
How small is small data?
There is no universally-agreed-upon-definition of the term “small data”–particularly not of the “small” part. The concrete number for “small” depends on the context. There might be contexts where it’s perfectly doable to get a set of 5,000 labeled data points. But other contexts might make it impossible to get more than 30. However, even if the exact definition of “small” may vary, it’s certainly not 100,000 or more data points.
And while the term “small data” may be a bit fuzzy, there is a more-clearly-defined type of machine learning called one-shot learning, or “few-shot learning”. As the name implies, one-shot learning aims to learn from one or a few labeled data points. Of course, there is no free lunch. One-shot learning does not come from nothing. Rather, one-shot learning uses existing prior knowledge to interpret new, previously unseen objects. And this existing prior knowledge has to come from somewhere.
Humans are one-shot learners
Human babies and small children–or humans generally–can learn from very small data sets (OK, “please clean up your room” doesn’t seem to fall into this category).
If you want to dig deeper on research in this area, I recommend watching the video below. It’s a talk by Josh Tenenbaum, who “is fascinated with the question of how human beings learn so much from so little, and how these insights can lead to build AI systems that are much more efficient at learning from data” (= Lex Friedman’s intro to Josh’s talk).
Current machine learning models don’t learn like humans
My focus here is not on discussing the difference between how humans vs. computers learn. But just consider the following aspect:
Currently (Sept 2021), one of the most advanced machine learning models is GPT-3. GPT-3 is a language model that was built to solve few-shot tasks. Such tasks include, for example, translating words and phrases that the model has never seen. In order to do this, GPT-3 used lots of data from one domain for training. Then the idea is to transfer what it learned from the large data set to never-before-seen tasks with little or no training data (remember my no-free-lunch remark above).
This is fascinating work. And if you’ve ever used “AI-powered writing companions” like Wordtune, you have seen the power of such approaches (Wordtune uses a model built by A21, not GPT-3).
But let’s look at power consumption. According to this estimate, GPT-3 used 190,000 kWh of electricity for training. Where I live, the average price of 1 kWh is 32 Euro-Cent (= 38 US-Cent). At this price, training GPT-3 would have cost me EUR 60,000 (USD 70,000) in electricity. I just checked, I could also buy a Tesla Model Y for that money…
By contrast, the human brain runs on 20 watts according to this estimate, or 12 watts according to this other estimate. And unlike GPT-3, on these 12-20 watts, the human brain does many other tasks too, not “just” language. This means that, if we stick to the 20 watts figure, with GPT-3’s 190,000 kWh, we could power a human brain for…
190,000,000 Wh / 20 W = 9,500,000 hours = 395,833 days = 1,084 years.
In other words, there is room for improvement (even if our estimates are off by a factor of 10 or 20).
Next, let’s look at who some of the players are that work on small data.
Small data companies and R&D
I used Mergeflow’s tech discovery software to look for companies and R&D across various applications. Here are some of my findings:
Small data companies
I focused on venture-backed companies. I did a simple search, using the terms “small data”, “one-shot learning”, “few-shot classification”, “few-shot learning”, and “low-data” for my search query. Here’s a screenshot of my results (click on the screenshot to see the full-size version):
yellow.ai is a customer experience automation platform. For their chatbots and voice bots, they use a proprietary NLP (natural language processing) model that’s based on a few-shot approach. According to their website, this approach enables users to go live with their conversational agent in 10 days. So while I could not find further technical details, this sounds as if they use a model like GPT-3.
TechSee is an “intelligent visual assistant”. You can use TechSee to provide support for field technicians, for example. Similar to what yellow.ai does for language, TechSee enables users to build models for new devices in a few hours, reducing the number of training data from several thousand to just several.
Docugami is a document engineering company. Their software enables you to manage and create documents in an enterprise setting. For example, if you want to draft an NDA, Docugami helps you assemble and compose the components–from across your company’s document repositories–that need to go into your NDA. This addresses the “data siloes” use case that I mentioned above.
Primal is building a text analytics platform that does not require large amounts of training data. Applications are in legal firms and financial services, for example. The company was founded by Peter Sweeney (who wrote the article I mentioned above, Where big data fails… and why).
Element AI is a platform for building AI solutions. One of their initial use cases was enabling forecasting models of small data sets (no further details provided however). In 2017, they raised a massive $102 Series A funding round, and the company was now acquired by ServiceNow.
R&D on one-shot learning
Please note that the following is by no means a comprehensive analysis of small data R&D activities. Consider the below “food for thought”, or a jumping-off-board for further explorations.
I already mentioned above that there is no clear-cut, universally-agreed-upon definition of “small data”. But there are types of machine learning that fall under the category. One of these types is transfer learning. In transfer learning, you use a model trained with (lots of) data from one domain, and transfer it to a different but related problem (raising the very expensive question, “What does ‘related’ mean exactly?”).
As mentioned in the CSET Issue Brief on small data, transfer learning is on the rise. The screenshot from Mergeflow below shows you how much it’s been rising over the past ten years:
However, here I chose to focus on R&D in small data other than transfer learning. One-shot learning, for example. The screenshot below shows a preview of the science publications results:
As you can see, the “publications over time” upward trend is very similar to transfer learning. But the absolute numbers are a lot lower (max. 80/month vs. max. 250/month).
In addition, the screenshot above shows a tag cloud of emerging technologies. Mergeflow identifies these and other emerging technologies in the contents it ingests. “Bigger font” means “more documents” in the tag cloud (not surprisingly, “deep learning and “machine learning” were biggest because one-shot learning is a form of machine learning).
Next, I only considered publications from the past year. I used the “emerging technology” tag cloud (cf. screenshot below) to zoom in on some applications of “small data R&D”:
Here are some of my findings (again, this is intended to be exemplary, not exhaustive):
Computational materials discovery
The purpose of using computational methods for materials discovery is to conserve resources and to make the discovery process more scalable. Running physical experiments in the lab is expensive and time-consuming; computational methods much less so. This implies that training data (from physical experiments) are hard to come by. In other words, small-data-approaches should be very attractive for computational materials discovery.
This field is still relatively young, but here is a paper that addresses the data availability issue in computational chemistry head-on:
Chemical language models enable navigation in sparsely populated chemical space
Neuroimaging studies the nervous system (brain, eyes, spine, nerves) using various imaging methods. When you use machine learning for neuroimaging (for example, to learn how to recognize brain structures from images), you have to deal with the fact that every person is different. Not to mention data privacy, which is a concern in medical applications. So you usually don’t have one large data set but many smaller ones.
The following paper investigated few-shot methods applied to brain activity data. “Decoding” here basically means “learning how to tell the cognitive state of a person by looking at their brain activity”:
Few-shot Decoding of Brain Activation Maps
Imagine you want to train a robot how to recognize or handle certain kinds of objects. Shoes, for example. Of course, not all shoes look the same. But wouldn’t it be great if the robot could somehow generalize from handling your running shoes to handling your dress shoes?
This is where few-shot learning comes in. And the paper below describes a database for few-shot-teaching machines (such as robots) how to recognize various objects across various conditions (lighting conditions, for instance):
ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition
Two recommendations for how to deal with small data, or machine learning more generally
1. Only start building machine learning systems once you know you have the prerequisites
If you are involved in building machine learning systems, before you start building anything, take a hard look at your situation and your use cases (make sure you have use cases; without use cases, project failure rate is 100%). Ask yourself the following questions:
- Do you have the training data and the compute resources required for building large-scale, parameter-rich models?
- Is your goal to build a model that will more or less do the same thing over and over again?
If your answer to one or both of these questions is “no”, you should seriously consider a “small data” approach.
2. Fund the solution of a problem, not the approach
If you are involved in making funding decisions, I recommend funding the solution to a problem, not the approach that this solution should take. As a positive example, consider Wellcome Leap’s Delta Tissue program. This is a $55 Mio. human health research program that aims to build a platform for “predicting changes in tissue state”. Yes, the program makes reference to “recent advances in machine learning”, and how they might help advance the goals of the program. But the goals of the program talk about solutions, not methods. For example, one goal is to “improve sample processing time by 5-10x”. Nowhere does this say “machine learning”, let alone “big data”.
Why is funding the solution and not the approach important?
If you fund the approach–as many other programs, startup incubators etc. do, unfortunately–people will stick to the approach, no matter if it works or not. In extreme cases, this might even encourage manipulation (because what do you do if the approach turns out not to work? Take a different approach? No, you can’t, because then your funding will stop).
By contrast, if you fund the solution, you give people the freedom to try out different approaches and select the one(s) that work(s). And then, even if you get only part of the way toward your goal, you’ll be making real progress.