Vanishing gradients problem in Deep learning
Sigiloso
when gradient must flow thru a lot of layers w transformations, by chain rule will become a product of a large quantity of jacobians. if eigenvalues of jacobians are mostly <1, product of lots of fractions is super small amount, resulting in super small gradients for early layers and slow/stuck training. sol is skip-connections, smart initialization of weights to roughly center jacobian eigenvalues around 1, or use relu or tanh not sigmoid.