Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks (when comparing performance against larger networks). I think everyone agrees that most neural networks are highly overparameterized as successful distillation efforts have shown.

However, this doesn't directly make my point about sample efficiency of today's algorithms compared to humans less valid. Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality). Although the expressiveness is clearly harmed by the reduced parameter count/altered network structure which possibly reduces the ability of the network to perform well for certain tasks.

I think it's important to clearly make a distinction between the required amount of computation and the number of data samples that are necessary when talking about scaling up existing methods. Compute is "cheap", while data isn't.

As a side note, I think the usefulness of the lottery ticket hypothesis is mostly about the ability of random initialization to already give a hint about the quality of the 'prior' that is encoded by the network structure. Useful for less computationally intense architecture search as also suggested by the papers and a paper by Andrew Ng on this topic.



> The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks

Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.

> Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality).

I'm not so sure about that either. From (https://arxiv.org/abs/2001.08361):

> Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

My second and third links are also important! The second talks about generalizing "winning" tickets across other datasets and optimizers. The third talks about weight agnostic neural networks, which in a nutshell are still capable of more-or-less performing a task even with _randomized_ weights.

Weight agnostic networks have a lot of parallels to wildlife that is capable of certain behaviors required for survival effectively immediately, before there's been a chance for significant learning to take place. This is the counterpoint I was referring to - an equivalent phenomenon could explain (at least partially) why humans require so much less data when learning.


> Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.

They state "smaller network, same test accuracy, with similar number of iterations". So it seems the original network size wasn't necessary for best test accuracy, and compute requirement is reduced only because it's a smaller network. Sample efficiency isn't increased according to https://arxiv.org/abs/1803.03635.

Good performance with random weights seems to indicate good 'priors' encoded in the network. Like how convolutional networks encode the prior of translational invariance and hence it being a naturally good performer on image inputs/tasks.

I think the parallel to "wildlife that is capable of certain behaviors ... before there's been a chance for significant learning to take place" is that priors are also part of biological intelligence. I.e. brain structure at birth enabling certain surivival oriented behaviors.

Hence, I'm optimistic about transfer learning which could happen through _both_ better models (priors that generalize well) and pretrained weights (possibly partially pretrained, i.e. just initial feature extraction). Either could potentially provide a better starting point from the 'how many samples are necessary for good performance on a variety of tasks' perspective.

The point is that either way information needs to be added for performance on tasks to increase. Doing that in a task specific way by using today's algorithms and a billion samples doesn't seem like the right approach. Finding algorithms, models or specifically perhaps neural network architectures (including training procedures, regularizers, loss function, weight tying) that generalize across tasks without needing many samples due to their informative priors seems the way forward to me. That's _not_ a naive scaling of today's algorithms to larger and larger training sets. Which was the point I was trying to make.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: