Training Data Shortages

A common view holds that the human-generated data set for training LLM models has been exhausted. All of the accessible digital data have been incorporated into training already, and additional data must come from synthetic sources.

This may be correct with respect to digital data. But accessible digital data represent that small fraction of human experience that has been captured digitally. Very little of your experience or knowledge is captured digitally, no matter how online you are.

Even if you record video and sound of your full day, you miss the senses of smell and touch. And yet an all-day recording of sound, let alone video, will produce a vastly richer data set than what exists today. Small wonder that companies have begun making devices that aim to capture that sound.

And what about the other non-sensory data? Will activity- or health-monitor data become part of the training data set?

How valuable will that added data be in training LLMs? I don’t know. It’s possible that the broad sea of new data has little training value. We’ll likely learn the answer as the recording devices roll out. My guess is that it will at least make the models more adaptive to individual personality. Can we finally rid ourselves of the obsequious responses?

After recording of sound, and maybe video, becomes ubiquitous, attention can turn to the missing senses of smell and touch, and thoughts. The head chips of the future can serve first to record and digitize these channels, before they turn us into cyborgs.

There is no shortage of additional human-generated data. There is a narrow capture of it. The new recording devices will dramatically increase the breadth of capture, and expand the training data set. Right now we depend largely on individuals to digitize their own data. Not for long.