### Why is a softmax used rather than dividing each activation by the sum?

Just wondering why a softmax is typically used in practice on outputs of most neural nets rather than just summing the activations and dividing each activation by the sum. I know it’s roughly the same thing but what is the mathematical reasoning behind a softmax over just a normal summation? Is it better in some…

### The reasoning behind the number of filters in the convolution layer

Let’s assume an extreme case in which the kernel of the convolution layer takes only values 0 or 1. To capture all possible patterns in input of $C$ number of channels, we need $2^{C*K_H*K_W}$ filters, where $(K_H, K_W)$ is the shape of a kernel. So to process a standard RGB image with 3 input channels…

### Variable binning for NN

I come from a background of scorecard development using logistic regression. Steps involved there are: 1. binning of continuous variables into intervals (eg age can be binned into 10-15 years, 15-20 years, etc) 2. weight of evidence transformation 3. coarse classing of bins to ensure event rate has a monotonic relationship with the variable Variable…

### Why can neural networks generalize at all?

Neural networks are incredibly good at learning functions. We know by the universal approximation theorem that, theoretically, they can take the form of almost any function – and in practice, they seem particularly apt at learning the right parameters. However, something we often have to combat when training neural networks is overtfitting – reproducing the…

### How do AI that play games of incomplete information decide their opening strategy?

This question was inspired by watching AlphaStar play Starcraft 2, but I’m also interested in the concept in general. How does the AI decide what build order to start with? In Starcraft, and many other games, the player must decide what strategy or class of strategies to follow as soon as the game begins. To…

### Has anyone seen this error.. if so how is it fixed?

from keras.layers import Conv2D, UpSampling2D, LeakyReLU, Concatenate, Lambda, Input, UpSampling2D from tensorflow.keras import Model from keras.applications.densenet import DenseNet169 ”’ Following is to get layers for skip connection and num_filters ”’ base_model = DenseNet169(include_top=False,input_shape=(224,224,3)) base_model_output_shape=base_model.layers[-1].output.shape decoder_filters = int(base_model_output_shape[-1]/2) def UpProject(array,filters,name,concat_with): up_i = UpSampling2D((2,2),interpolation=’bilinear’)(array) up_i=Concatenate(name=name+’_concat’)([up_i,base_model.get_layer(concat_with).output]) #skip connection up_i=Conv2D(filters=filters,kernel_size=3,strides=1,padding=’same’,name=name+’_convA’)(up_i) up_i=LeakyReLU(alpha=.2)(up_i) up_i=Conv2D(filters=filters,kernel_size=3,strides=1,padding=’same’,name=name+’_convB’)(up_i) up_i=LeakyReLU(alpha=.2)(up_i) return up_i def get_Model(): #encoder network…

### Which CNN hyper-parameters are most sensitive to centered versus off centered data?

Which hyper-parameters of a convolutional neural network are likely to be the most sensitive to depending on whether the training (and test and inference) data involves only accurately centered images versus off-centered images. More convolutional layers, wider convolution kernels, more dense layers, wider dense layers, more or less pooling, or ??? e.g. If I can…

### How difficult is this sound classification?

I want a microphone to pick up sounds around me (let’s say beyond a 3 foot radius), but ignore sounds made at my desk, such as the rustling of paper, clicking a mouse and typing, my hands brushing up on the table, putting a pen down, etc. How hard would it be for AI to…

### Average Reward for Temporal Difference (TD), and how it’s used in Actor-Critic algorithm

In Sutton & Barto’s book (2nd edition) chapter 10 is given the equation for TD(0) Error with Average Reward: $\delta_t = R_{t+1} – \bar{R} + \hat{v}(S_{t+1}, \mathbf{w}) – \hat{v}(S_{t}, \mathbf{w}) \hspace{6em} (10.10)$ Can anyone explain the intuition behind this equation? And how exactly it is derived? Also, in chapter 13, section 6, is given the…

### Why do we average gradients and not loss in distributed training?

I’m running some distributed trainings in Tensorflow with Horovod. It runs training separately on multiple workers, each of which uses the same weights and does forward pass on unique data. Computed gradients are averaged within the communicator (worker group) before applying them in weight updates. I’m wondering – why not average the loss function across…