What makes GAN or VAE better at image generation than NN that directly maps inputs to images

Say a simple neural network’s input is a collection of tags (encoded in some way), and the output is an image that corresponds to those tags. Say this network consists of some dense layers and some reverse (transpose) convolution layers. What is the disadvantage of this network, that directs people to invent fairly complicated things…