Neural networks are incredibly good at learning functions. We know by the universal approximation theorem that, theoretically, they can take the form of almost any function – and in practice, they seem particularly apt at learning the right parameters. However, something we often have to combat when training neural networks is overtfitting – reproducing the training data and not generalizing to a validation set. The solution to overfitting is usually to simply add more data, with the rationalization that at a certain point the neural network pretty much has no choice but to learn the correct function.
But this never made much sense to me. There is no reason, in terms of loss, that a neural network should prefer a function that generalizes well (i.e. the function you are looking for) over a function that does incredibly well on the training data and fails miserably everywhere else. In fact, there is usually a loss advantage to overfitting. Equally, there is an infinite number of functions that fit the training data and have no success on anything but.
So why is it that neural networks almost always (especially for simpler data) stumble upon the function we want, as opposed to one of the infinite other options? Why is it that neural networks are good at generalizing, when there is no incentive for them to?