1. What's more difficult to optimize - nonconvex or convex? 2. What activation function does better with vanishing gradient - ReLU or sigmoid? 3. Does vanishing gradient happen closer to the beginning or end of the neural network? 4. medium-level backtracking question