Wednesday, October 19, 2011

Unexplained jumps in Seconds_Behind_Master

I am trying to understand why a server would go from 0 to 45 and then back to 0 seconds of replication lag as reported by the Seconds_Behind_Master column in SHOW SLAVE STATUS output. This occurs over a few seconds so there isn't a statement that runs for 45 seconds on the slave. I then compared consecutive SET TIMESTAMP values in the binlog and the absolute value of the differences is at most 2 seconds.

Has anyone else been confused by this? I filed bug 62839 and think there is a race condition in the code that computes the value for Seconds_Behind_Master. If I am correct about this then we need to fix the problem. Replication lag is a big problem for many of us and reporting a value that is much larger than the actual value is bad PR.

12 comments:

  1. This reminded me about another bug, not sure whether there is a bug report for it, haven't searched.

    Sometimes reported Exec_Master_Log_Pos is invalid. It jumps to some low value (0-200) and on the next `SHOW SLAVE STATUS` returns back to the correct position.

    The pattern of the jump is very similar to the described in your case, though in my case it jumps back on the next `SHOW SLAVE STATUS` run and since it is done programmatically, it takes less then a second.

    ReplyDelete
  2. Yes. Some years ago I devised a benchmark to measure max replication throughput. The idea was that I would connect to the slave and read Seconds_Behind_Master, and if it was above some threshold (like 5 sec) then I would throttle the benchmark.

    I was surprised to see Seconds_Behind_Master would frequently show very high values, like millions of seconds, then dropping back to 0. I don't think I ever saw a reasonable wrong value like 45 seconds, but of course that would have easily been missed by my setup.

    My fix was just to ignore absurdly large values. I may have filed a bug, or at least I sent an internal email, but the response was just something like it's been seen before and nobody had an idea for a fix.

    ReplyDelete
  3. Have you checked if your master and slave have the exact same system time ?
    Since Seconds_Behind_Master is the difference of TIMESTAMP on the slave and the value of TIMESTAMP in the binary log, it could be possible that you have a slight time desynchronisation between your two servers.

    ReplyDelete
  4. Could it be http://www.pythian.com/news/853/when-show-slave-status-lies/ ?

    ReplyDelete
  5. Maybe similar. Might be related to slave disconnect/reconnect to master after an error, but the bugs referenced in that are fixed in the 5.1 versions we use.

    ReplyDelete
  6. I think this is another instance of confusion when Seconds_Behind_Master is reset to zero when the SQL thread catches up to the IO thread. This has confused me a couple of times as official MySQL has not much monitoring to let you know that is happening other than SBM flipping between zero and non-zero.

    ReplyDelete
  7. It's zero when the SQL thread isn't executing anything. That can be when it's caught up to IO, or just between statements when it's racing through a log of lots of small changes. Isn't this the whole reason we switched to heartbeat?

    ReplyDelete
  8. The problem that has confused me is when there is real lag (10+ seconds) but the IO thread is behind the master and the SQL thread is able to keep up with the IO thread. In that case SBM will flip between 0 and the real lag value. Heartbeat doesn't help in this case. Our workload has frequent transactions.

    I don't remember all of the reasons heartbeat was used at G.

    ReplyDelete
  9. I've seen the same thing, but with even larger numbers. Just now, I saw SBM go from 0 to 20800 to 0 within 2 seconds. I see it regularly enough on multiple slaves that I've just had to start ignoring SBM unless it's consistent for more than a few seconds.

    Would be nice to see it fixed though...

    ReplyDelete
  10. I've seen the same thing. It can also be NTP messed up between Master and Slave, then NTP corrects it. Also I've seen 0 to 100K+ seconds. I thought that was due to some divide by 0 issue.

    ReplyDelete
  11. I've seem weird SBM values when the binlogs and/or relay logs were rotated.

    ReplyDelete
  12. I've seen this before when I have two slaves with the same server_id. it coincides with relay logs rotating like crazy.

    ReplyDelete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.