In “traditional neural nets” such as back propagation or RBF with 1-3 hidden layers, in the past I have used a guideline for good generalization to have 10 times as many training records as free parameters, or neural net weights.
However, from what I have seen in deep learning papers, it seems like while DL can make good use of huge amounts of data, how well does it work with smaller amounts of data? What is a problem that is too small (in terms of the number of training records) for DL? Sure, convolutional NN have many duplicated weights for the shifting convolution that can be tied to one update. Yes, I am aware of drop out a % of a fully connected layer. Does Relu help with generalization more? My applications of interest are not CNN.
Any links to good related reading?
Imagenet competitions have 1000 image categories, but may have only a few hundred examples per target category. I assume there is a lot of generalization with lower level feature extraction.
One intuition I have (to be discussed) is that maybe I should look at the number of number of weights connecting two consecutive layers, especially if training autoencoder style, but I don’t know if that would generalize to non-encoder DL net configurations (for generalization and minimal training record estimations).