Why do current models use multiple normalization layers?

In most current models, the normalization layer is applied after each convolution layer. Many models use block: Conv->BatchNorm->ReLU repeatedly. But why do we need multiple BatchNorm layers? If we have a Conv that receives a normalized input, shouldn’t it spit out a normalized output? Isn’t it enough to place normalization layers only at the beginning…

Should I consider mean or sampled value for action selection in ppo algorithm?

When considering the policy network in PPO algorithm, we need to fit a gaussian distribution to the neural network output(for a continuous action space problem). When I use this network to obtain action, should I sample from the fitted distribution or directly consider the mean output from the policy nn? Also, what should be the…

What would be the implications of mistakenly adding bias after the activation function?

I was looking at the source code for a personal project neural network implementation, and the bias for each node was mistakenly applied after the activation function. The output of each node was therefore $\sigma\big(\sum_{i=1}^n w_i x_i\big)+b$ instead of $\sigma\big(\sum_{i=1}^n w_i x_i + b\big)$. Assuming standard back-propagation algorithms are used for training, what (if any)…

Tabular Datasets where Deep neural networks outperforms XGBoost

Are there (complex) tabular datasets where deep neural networks (e.g. more than 3 layers) outperform traditional methods such as XGBoost by a large margin? I’d prefer tabular datasets rather than image datasets, since most image dataset are either too simple that even XGBoost can perform well (e.g. MNIST), or too difficult for XGBoost that its…