Skip to content
  • Nuno Carvalho's avatar
    d90dcae9
    BUG#22628341: GTID HOLES WITH MULTIPLE WRITERS AND REDUCED PERFORMANCE · d90dcae9
    Nuno Carvalho authored
    Scenario
    ========
    Group Replication with two members, both committing conflicting
    transactions. To be more exact, all transactions do conflict, this
    it is the worst case scenario. Example:
        M1         M2
      T1: W(A)   T2: W(A)
      T3: W(B)   T4: W(B)
    
    On each round, one of the transactions will rollback, and clients
    move to the next round. There is no synchronization between servers.
    
    Analysis
    ========
    After running dozens of rounds the value of GTID_EXECUTED on both
    members is:
    
     M1:
       UUID:1-56:58:61-62:64:66:69-70:72:74:76:78:80:82:84:86:88:90
     M2:
      UUID:1-55:57:59-60:63:65:67-68:71:73:75:77:79:81:83:85:87:89:91
    
    The question it is: what did happen to GNO 55 and/or 56 that caused
    GTID_EXECUTED to be different from that point on the two members.
    Also looking to error logs on both servers, we see "Duplicate entry
    '53' for key 'PRIMARY'" errors. Appliers did error out on both members.
    
    Looking to certification information we see:
    GNO: 55; write_set: '3614141148'; snapshot_version: 'UUID:1-54'
    GNO: 56; write_set: '2045756991'; snapshot_version: 'UUID:1-54'; source: M1
    GNO: 57; write_set: '2045756991'; snapshot_version: 'UUID:1-55'; source: M2
    
    We can see that the same write-set, 2045756991, it is certified with
    two different snapshot_version (current GTID_EXECUTED when the
    transaction was instructed to commit).
    GNO 56 belongs to M1 and GNO 57 belongs to M2, both
    insert value 53.
    
    On this scenario, M1 is, due to natural reasons, not able to
    commit anything, all its transactions are rollback, until finally it
    is able to send its broadcast to the group first than M2 and make
    transaction with GNO 56 pass. Until this moment it has received 55
    transactions but it is still applying one, the one with GNO 55.
    So when it does broadcast transaction that will GNO 56, its
    GTID_EXECUTED it is UUID:1-54, GNO 55 it is not yet applied.
    Since this transaction does not conflict with any other on that
    snapshot_version, it is accept to commit.
    
    Meanwhile on M2, the same data it is broadcast for certification,
    but since this server so far did only had local transactions, its
    GTID_EXECUTED it is UUID:1-55. So this transaction will be accepted
    to commit, since its version supersedes UID:1-54. It is better to
    write a table:
      M1: T1: ws=53; sv=UUID:1-54
      M2: T2: ws=53; sv=UUID:1-55
    Since snapshot_version from T2 is bigger than the one from T1,
    this is not a conflict.
    
    Legend:
      ws: writ-set
      sv: snapshot_version
    
    Thought, indeed there is a problem here, because now we have two
    transactions with the same data that will error out on apply, since
    applier is slower than local transactions, applier threads will
    error out.
    
    Solution
    ========
    When a transaction is accepted to commit (no conflicts found), we
    need to include its own global identifier on its snapshot version,
    so that all transactions that touch the same data but did not yet
    seen the current transaction are rollback.
    The correct trace of this scenario conflict detection it is:
      M1: T1: ws=53; sv_in=UUID:1-54, sv=UUID:1-55 <- accepted to commit
      M2: T2: ws=53; sv_in=UUID:1-55 <- conflicts with T1, rollback
    Since T2 snapshot version it not bigger than the one from T1, the
    conflict it is correctly detected and T2 is rollback
    
    Legend:
      ws:    writ-set
      sv_in: snapshot_version sent together with the transaction
      sv:    snapshot_version persisted after transaction it is
             accepted to commit.
    
    Changes
    =======
    To implement the solution for this bug, the following changes were
    required:
    
    Server side
    -----------
     1. Add functionality to copy a Sid_map, this will avoid that all
        operations on top of snapshot version do require global_sid_lock
        to be acquired.
        Files: log_event.cc, log_event.h, rpl_gtid.h, rpl_gtid_sid_map.cc
    
    Validation and testing is done on BUG 21616303, on Group Replication
    plugin.
    d90dcae9
    BUG#22628341: GTID HOLES WITH MULTIPLE WRITERS AND REDUCED PERFORMANCE
    Nuno Carvalho authored
    Scenario
    ========
    Group Replication with two members, both committing conflicting
    transactions. To be more exact, all transactions do conflict, this
    it is the worst case scenario. Example:
        M1         M2
      T1: W(A)   T2: W(A)
      T3: W(B)   T4: W(B)
    
    On each round, one of the transactions will rollback, and clients
    move to the next round. There is no synchronization between servers.
    
    Analysis
    ========
    After running dozens of rounds the value of GTID_EXECUTED on both
    members is:
    
     M1:
       UUID:1-56:58:61-62:64:66:69-70:72:74:76:78:80:82:84:86:88:90
     M2:
      UUID:1-55:57:59-60:63:65:67-68:71:73:75:77:79:81:83:85:87:89:91
    
    The question it is: what did happen to GNO 55 and/or 56 that caused
    GTID_EXECUTED to be different from that point on the two members.
    Also looking to error logs on both servers, we see "Duplicate entry
    '53' for key 'PRIMARY'" errors. Appliers did error out on both members.
    
    Looking to certification information we see:
    GNO: 55; write_set: '3614141148'; snapshot_version: 'UUID:1-54'
    GNO: 56; write_set: '2045756991'; snapshot_version: 'UUID:1-54'; source: M1
    GNO: 57; write_set: '2045756991'; snapshot_version: 'UUID:1-55'; source: M2
    
    We can see that the same write-set, 2045756991, it is certified with
    two different snapshot_version (current GTID_EXECUTED when the
    transaction was instructed to commit).
    GNO 56 belongs to M1 and GNO 57 belongs to M2, both
    insert value 53.
    
    On this scenario, M1 is, due to natural reasons, not able to
    commit anything, all its transactions are rollback, until finally it
    is able to send its broadcast to the group first than M2 and make
    transaction with GNO 56 pass. Until this moment it has received 55
    transactions but it is still applying one, the one with GNO 55.
    So when it does broadcast transaction that will GNO 56, its
    GTID_EXECUTED it is UUID:1-54, GNO 55 it is not yet applied.
    Since this transaction does not conflict with any other on that
    snapshot_version, it is accept to commit.
    
    Meanwhile on M2, the same data it is broadcast for certification,
    but since this server so far did only had local transactions, its
    GTID_EXECUTED it is UUID:1-55. So this transaction will be accepted
    to commit, since its version supersedes UID:1-54. It is better to
    write a table:
      M1: T1: ws=53; sv=UUID:1-54
      M2: T2: ws=53; sv=UUID:1-55
    Since snapshot_version from T2 is bigger than the one from T1,
    this is not a conflict.
    
    Legend:
      ws: writ-set
      sv: snapshot_version
    
    Thought, indeed there is a problem here, because now we have two
    transactions with the same data that will error out on apply, since
    applier is slower than local transactions, applier threads will
    error out.
    
    Solution
    ========
    When a transaction is accepted to commit (no conflicts found), we
    need to include its own global identifier on its snapshot version,
    so that all transactions that touch the same data but did not yet
    seen the current transaction are rollback.
    The correct trace of this scenario conflict detection it is:
      M1: T1: ws=53; sv_in=UUID:1-54, sv=UUID:1-55 <- accepted to commit
      M2: T2: ws=53; sv_in=UUID:1-55 <- conflicts with T1, rollback
    Since T2 snapshot version it not bigger than the one from T1, the
    conflict it is correctly detected and T2 is rollback
    
    Legend:
      ws:    writ-set
      sv_in: snapshot_version sent together with the transaction
      sv:    snapshot_version persisted after transaction it is
             accepted to commit.
    
    Changes
    =======
    To implement the solution for this bug, the following changes were
    required:
    
    Server side
    -----------
     1. Add functionality to copy a Sid_map, this will avoid that all
        operations on top of snapshot version do require global_sid_lock
        to be acquired.
        Files: log_event.cc, log_event.h, rpl_gtid.h, rpl_gtid_sid_map.cc
    
    Validation and testing is done on BUG 21616303, on Group Replication
    plugin.
Loading