“The scaling hypothesis” is the name given to the idea that the existing algorithms might be all we need if we just throw more compute at it. Certainly GPT-3 is a very interesting data point here. However we definitely also need better algorithms. It’s a mix of scaling a new algorithms that will get us to AGI.
I think that 'just' scaling todays algorithms is quite a naive approach as it would imply the need of huge amounts of training samples for simple tasks (simple to humans). Given humans tend to need an order of magnitude less samples before being able to generalize I think we need more than just scaled up versions of todays NNs, SVMs, Trees, what-have-you.
But a human isn't trained from scratch. Babies go through huge amounts of unsupervised learning to build up a basic vision and language framework.
Training a neural network to recognize dog pictures is like connecting electrodes to your tongue and trying to do the same. Rudimentary "vision" (very small resolution) has actually been demonstrated this way in human experiments, but you definitely need more than a few examples.
A fairer comparison is: can giant pertained NNs learn to generalize with few examples, and the answer seems to be yes.
I agree that there are clearly algorithmic improvements remaining to be made. However, a counterpoint to your specific example would be the lottery ticket hypothesis and related weight agnostic neural networks.
The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks (when comparing performance against larger networks). I think everyone agrees that most neural networks are highly overparameterized as successful distillation efforts have shown.
However, this doesn't directly make my point about sample efficiency of today's algorithms compared to humans less valid. Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality). Although the expressiveness is clearly harmed by the reduced parameter count/altered network structure which possibly reduces the ability of the network to perform well for certain tasks.
I think it's important to clearly make a distinction between the required amount of computation and the number of data samples that are necessary when talking about scaling up existing methods. Compute is "cheap", while data isn't.
As a side note, I think the usefulness of the lottery ticket hypothesis is mostly about the ability of random initialization to already give a hint about the quality of the 'prior' that is encoded by the network structure. Useful for less computationally intense architecture search as also suggested by the papers and a paper by Andrew Ng on this topic.
> The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks
Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.
> Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality).
> Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
My second and third links are also important! The second talks about generalizing "winning" tickets across other datasets and optimizers. The third talks about weight agnostic neural networks, which in a nutshell are still capable of more-or-less performing a task even with _randomized_ weights.
Weight agnostic networks have a lot of parallels to wildlife that is capable of certain behaviors required for survival effectively immediately, before there's been a chance for significant learning to take place. This is the counterpoint I was referring to - an equivalent phenomenon could explain (at least partially) why humans require so much less data when learning.
> Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.
They state "smaller network, same test accuracy, with similar number of iterations". So it seems the original network size wasn't necessary for best test accuracy, and compute requirement is reduced only because it's a smaller network. Sample efficiency isn't increased according to https://arxiv.org/abs/1803.03635.
Good performance with random weights seems to indicate good 'priors' encoded in the network. Like how convolutional networks encode the prior of translational invariance and hence it being a naturally good performer on image inputs/tasks.
I think the parallel to "wildlife that is capable of certain behaviors ... before there's been a chance for significant learning to take place" is that priors are also part of biological intelligence. I.e. brain structure at birth enabling certain surivival oriented behaviors.
Hence, I'm optimistic about transfer learning which could happen through _both_ better models (priors that generalize well) and pretrained weights (possibly partially pretrained, i.e. just initial feature extraction). Either could potentially provide a better starting point from the 'how many samples are necessary for good performance on a variety of tasks' perspective.
The point is that either way information needs to be added for performance on tasks to increase. Doing that in a task specific way by using today's algorithms and a billion samples doesn't seem like the right approach. Finding algorithms, models or specifically perhaps neural network architectures (including training procedures, regularizers, loss function, weight tying) that generalize across tasks without needing many samples due to their informative priors seems the way forward to me. That's _not_ a naive scaling of today's algorithms to larger and larger training sets. Which was the point I was trying to make.
I think you are not appreciating the difference between a "commercial" NN and the human brain. NNs usually are designed for specific tasks that are simply a subset of the capability of humans. The human brain is huge and therefore an equivalent NN would also be huge. Instead we have lots of small networks and many of them are even competing and trying to solve the same problem.
You need a lot of samples because you're starting from scratch with each network. If you had one super NN that is equally powerful to a bunch of small networks then you would have a network that can easily generalize because it can use existing data as a starting point. The amount of existing data that is useful to an unknown task grows with the size of the NN.
An NLP NN for English could be combined with an image recognition NN. Since the NLP NN already has a concept for "cars" it only has to associate its already learned definition of "car" with images of cars. If you have separate NNs then you will have to teach the both NNs what a car is twice. With small NNs there will always be some redundancy and that redundancy is a fixed cost.
That's an interesting hypothesis. Is there objective evidence that humans need an order of magnitude less samples? One anecdote to possibly challenge you is that it takes a toddler several months to walk.
There's actually a paper that attempts to quantify the scaling characteristics of NLP models from early this year. (https://arxiv.org/abs/2001.08361)
Even if scaling alone ends up solving everything (I doubt it), I'd still feel that very significant improvements ought to be possible from an algorithmic perspective. (I realize that's largely baseless, but for some reason I just can't escape the feeling that current algorithms leave a huge amount of potential on the table.)