Q.By adding the momentum in Momentum Optimizer it increases the learning rate so will it not make it escapes the global minima?
The momentum get cumulated. It would keep increasing until it has crossed the minima. The moment it crosses the minima, the momentum will start decreasing because the sign of the slope has changed.
So, it would skip the minima but then it would come back just a rolling ball. If there is a bowl in which we leave the ball, the ball gain momentum and would not stop at the lowest point instead it would go some distance to the other side. It will keep oscillating around the minima before settling.
The same happens in case of Momentum Optimizer.
very good question!, Soumyadeep.
Momentum§ is directly proportional to velocity(v). ( p = mxv).
and velocity is directly proportional to rate of change of displacement(x)(v=delta(x)/t). which will also give the slope when you plot the graph.
so for every iterations the momentum will be calculated based on the change in the displacement(x).
Assumptions :-
let us say it has taken 2 steps d1 and d2.
Let initial momentum is 0 as iterations not started!
Initial velocity (v1)=0. so (p1=0). body at rest 0 momentum initial velocity 0.
let m(mass)=1
iterations time = 1 sec
Step1:-
p1=v1 = 0
mass=1
time = 1 sec
let iterations started and it is moving with some v1 velocity and covered some d1 distance.
- Calculate the current velocity at d1.
v2(current velocity)= dx(d1)/dx(t) gradient ( change w.r.t distance)
Step2:-
-
Calculate the future distance that will be covered based on knowns v2 and v1 at d1.
d2=(v2-v1)xt (v1 from step 1). (approximate next positions).
-
calculate the present momentum at d1.
p2(current momentum)=(d2-d1)/t (find momentum based on approximate next position d2). -
v3(current velocity)= dx(d2)/dx(t) or gradient ( change w.r.t distance)
Step3 :=
So, our momentum was good and now we are at d2.
Repeat step 2 with d2,v3
1) d3=(v3-v2)xt (approximate next positions)
2) p3(current momentum)=(d3-d2)/t let us take iteration time is 1 sec.
3) v4(current velocity)= dx(d3)/dx(t)
Observations :-
We observed that the above algorithm is finding the current momentum based on the approximate next position and calculates the gradient
dx(d2)/dx(t) w.r.t to the next approximated position.
This thing prevents us from going too fast as it depends on the future distance and present velocity.
if the predicted distance(path towards the local or global minima) is less our present velocity will be decreased.
So, more is the distance more is velocity and less is distance lesser is velocity or learning rate.
This is how it reaches to the local or global minima. (There is also a problem to get struck in local minima).
so we use stochastic gradient descent (SGD) + momentum will help you to get out of the Local minima and go to Global minima.
Nesterov Momentum + SGD will give good response in this case.
Note:-
This is completely based on my understanding of Maths, Physics and Momentum concept.
All the best!
thank you very much for nice explanation