Another Technical Report on stochastic gradient descent of little interest to non-AI-nerds. Follow-up to this. I tried Adam, but it wasn't doing nearly as well as RMSprop. After some investigation, it turns out RMSprop was relying heavily on the 10^-6 epsilon, which due to being inside the sqrt was acting more like a 10^-3 damping term. After setting Adam's epsilon to 10^-3 it did about as well as