Skip to content
  • Priyanka Sangam's avatar
    28f2a939
    Bug #27566346 NDB : BACKUP WITH SNAPSHOTSTART CONSISTENCY ISSUES · 28f2a939
    Priyanka Sangam authored
    
    
    Reviewed-by: default avatarFrazer Clement <frazer.clement@oracle.com>
    
    A backup needs to be restored to a consistent point, for which it
    uses a fuzzy scan and a log.
    
    The fuzzy scan is restored, and then the log is replayed
    idempotently up to some consistent point which is after
    (SNAPSHOTEND) or before (SNAPSHOTSTART) any of the states captured
    in the scan.
    
    This requires that the backup is captured in order :
    1) Start recording logs of all committed transactions
    2) Choose SNAPSHOTSTART consistent point, save as StartGCP
    3) Perform data scan
    4) Choose SNAPSHOTEND consistent point, save as StopGCP
    5) Stop recording logs
    
    Choosing the consistent point is done by sending WAIT_GCP_REQ to
    DIH which blocks until the open GCI at the time it was received
    has been completed (subject to being fixed in Bug 27497461),
    then returns that GCI number. So the consistent point is the *end*
     of the WAIT_GCP_REQ GCI.
    
    For SNAPSHOTEND backups, we need to ensure that the log contains
    all changes up to + beyond the last changes recorded by the scan
    so that redo takes us to a consistent point. This means that we
    will apply all of the changes *up to and including* the
    StopGCP to recover to a consistent point. We must include
    the StopGCP, to ensure that anything committed and included at
    the end of the scan is also logged.
    
    For SNAPSHOTSTART backups, we need to ensure that the log contains all
    changes going back to before the first changes recorded by the scan
    so that Undo takes us to a consistent point.  This means that we will
    apply all of the changes down to *but not including* the WAIT_GCP_REQ
    GCI.
    
    1) It is not guaranteed that every log entry contains a GCP. If
    there is a block of entries which contain the same GCP, the GCP is
    logged for the first entry and left blank for the remaining entries.
    This is problematic for undo logging since the file is read from end
    to beginning.
    
    This is fixed by modifying backup to log the GCP for every undo log
    entry. This does not break the format since it is still within the
    spec. GCP inclusion is critical for the correct behaviour in undo
    log replay, so there will be a behavioural difference while using
    ndb_restore on 'before patch' and 'after patch' backups.
    
    In the worst case, where all the log entries have the same GCI, this
    change adds an overhead of 4 bytes for every log entry except the last.
    A log entry is 12 bytes long before data and keys are included.
    
    The format of redo log entries is not changed, so there is no change
    to SNAPSHOTEND backup or restore.
    
    2) Undo logging begins before startGCP, so the undo log may contain
    entries with GCP <= startGCP. Also, log entries are not ordered by GCP,
    so there may be entries from different GCPs interleaved in the log file.
    
    This is fixed by modifying ndb_restore undo log replay to replay log
    entries only if their GCPs > startGCP. So the undo logfile is still
    read from EOF to file-start, but all the log entries with GCPs <=
    startGCP are skipped. Undo log replay does not terminate when a log
    entry with GCP <= startGCP is found, since there may still be entries
    with GCP > startGCP beyond the current entry.
    
    No change to SNAPSHOTEND backup or restore.
    
    3) The consistent GCP should be restored into ndb_apply_status.
    ndb_restore always restores the stopGCP as the consistent GCP, which
    is incorrect for SNAPSHOTSTART backups.
    
    This is fixed by modifying ndb_restore to restore the startGCP to
    ndb_apply_status for SNAPSHOTSTART backups
    
    No change to SNAPSHOTEND backup or restore.
    
    The existing behaviour will be retained in these 2 cases:
    
    a) If patched ndb_restore used on 'old' SNAPSHOTSTART backup from
    non-patched data node: undo log is replayed to start of log, so
    log entries prior to startGCP may be replayed.
    
    b) If unpatched ndb_restore used on fixed SNAPSHOTSTART backup from
    patched data node: undo log entries with GCP <= startGCP will be
    skipped, but ndb_restore may assign incorrect GCP to some log entries,
    possibly resulting in replay of entries with GCP <= startGCP.
    
    ConsistencyUnderLoad testcases are added for SNAPSHOTSTART backups.
    The GCP stall testcase stalls the GCP at the beginning of the backup
    to ensure that the backup log includes log entries from startGCP-1.
    The test then checks that these entries have not been replayed.
    
    These tests replace a less comprehensive mtr test, which is
    removed.
    28f2a939
    Bug #27566346 NDB : BACKUP WITH SNAPSHOTSTART CONSISTENCY ISSUES
    Priyanka Sangam authored
    
    
    Reviewed-by: default avatarFrazer Clement <frazer.clement@oracle.com>
    
    A backup needs to be restored to a consistent point, for which it
    uses a fuzzy scan and a log.
    
    The fuzzy scan is restored, and then the log is replayed
    idempotently up to some consistent point which is after
    (SNAPSHOTEND) or before (SNAPSHOTSTART) any of the states captured
    in the scan.
    
    This requires that the backup is captured in order :
    1) Start recording logs of all committed transactions
    2) Choose SNAPSHOTSTART consistent point, save as StartGCP
    3) Perform data scan
    4) Choose SNAPSHOTEND consistent point, save as StopGCP
    5) Stop recording logs
    
    Choosing the consistent point is done by sending WAIT_GCP_REQ to
    DIH which blocks until the open GCI at the time it was received
    has been completed (subject to being fixed in Bug 27497461),
    then returns that GCI number. So the consistent point is the *end*
     of the WAIT_GCP_REQ GCI.
    
    For SNAPSHOTEND backups, we need to ensure that the log contains
    all changes up to + beyond the last changes recorded by the scan
    so that redo takes us to a consistent point. This means that we
    will apply all of the changes *up to and including* the
    StopGCP to recover to a consistent point. We must include
    the StopGCP, to ensure that anything committed and included at
    the end of the scan is also logged.
    
    For SNAPSHOTSTART backups, we need to ensure that the log contains all
    changes going back to before the first changes recorded by the scan
    so that Undo takes us to a consistent point.  This means that we will
    apply all of the changes down to *but not including* the WAIT_GCP_REQ
    GCI.
    
    1) It is not guaranteed that every log entry contains a GCP. If
    there is a block of entries which contain the same GCP, the GCP is
    logged for the first entry and left blank for the remaining entries.
    This is problematic for undo logging since the file is read from end
    to beginning.
    
    This is fixed by modifying backup to log the GCP for every undo log
    entry. This does not break the format since it is still within the
    spec. GCP inclusion is critical for the correct behaviour in undo
    log replay, so there will be a behavioural difference while using
    ndb_restore on 'before patch' and 'after patch' backups.
    
    In the worst case, where all the log entries have the same GCI, this
    change adds an overhead of 4 bytes for every log entry except the last.
    A log entry is 12 bytes long before data and keys are included.
    
    The format of redo log entries is not changed, so there is no change
    to SNAPSHOTEND backup or restore.
    
    2) Undo logging begins before startGCP, so the undo log may contain
    entries with GCP <= startGCP. Also, log entries are not ordered by GCP,
    so there may be entries from different GCPs interleaved in the log file.
    
    This is fixed by modifying ndb_restore undo log replay to replay log
    entries only if their GCPs > startGCP. So the undo logfile is still
    read from EOF to file-start, but all the log entries with GCPs <=
    startGCP are skipped. Undo log replay does not terminate when a log
    entry with GCP <= startGCP is found, since there may still be entries
    with GCP > startGCP beyond the current entry.
    
    No change to SNAPSHOTEND backup or restore.
    
    3) The consistent GCP should be restored into ndb_apply_status.
    ndb_restore always restores the stopGCP as the consistent GCP, which
    is incorrect for SNAPSHOTSTART backups.
    
    This is fixed by modifying ndb_restore to restore the startGCP to
    ndb_apply_status for SNAPSHOTSTART backups
    
    No change to SNAPSHOTEND backup or restore.
    
    The existing behaviour will be retained in these 2 cases:
    
    a) If patched ndb_restore used on 'old' SNAPSHOTSTART backup from
    non-patched data node: undo log is replayed to start of log, so
    log entries prior to startGCP may be replayed.
    
    b) If unpatched ndb_restore used on fixed SNAPSHOTSTART backup from
    patched data node: undo log entries with GCP <= startGCP will be
    skipped, but ndb_restore may assign incorrect GCP to some log entries,
    possibly resulting in replay of entries with GCP <= startGCP.
    
    ConsistencyUnderLoad testcases are added for SNAPSHOTSTART backups.
    The GCP stall testcase stalls the GCP at the beginning of the backup
    to ensure that the backup log includes log entries from startGCP-1.
    The test then checks that these entries have not been replayed.
    
    These tests replace a less comprehensive mtr test, which is
    removed.
Loading