mysql-test/suite/ndb/t/ndb_restore_misc.test · ba12387e40a6cf2adb3746e9dc5271f1772ce7ab · Rasoul Jahanshahi / Mysql Server

Mar 26, 2019

Bug #27566346 NDB : BACKUP WITH SNAPSHOTSTART CONSISTENCY ISSUES · 28f2a939

Priyanka Sangam authored Mar 26, 2019



Reviewed-by: Frazer Clement <frazer.clement@oracle.com>

A backup needs to be restored to a consistent point, for which it
uses a fuzzy scan and a log.

The fuzzy scan is restored, and then the log is replayed
idempotently up to some consistent point which is after
(SNAPSHOTEND) or before (SNAPSHOTSTART) any of the states captured
in the scan.

This requires that the backup is captured in order :
1) Start recording logs of all committed transactions
2) Choose SNAPSHOTSTART consistent point, save as StartGCP
3) Perform data scan
4) Choose SNAPSHOTEND consistent point, save as StopGCP
5) Stop recording logs

Choosing the consistent point is done by sending WAIT_GCP_REQ to
DIH which blocks until the open GCI at the time it was received
has been completed (subject to being fixed in Bug 27497461),
then returns that GCI number. So the consistent point is the *end*
 of the WAIT_GCP_REQ GCI.

For SNAPSHOTEND backups, we need to ensure that the log contains
all changes up to + beyond the last changes recorded by the scan
so that redo takes us to a consistent point. This means that we
will apply all of the changes *up to and including* the
StopGCP to recover to a consistent point. We must include
the StopGCP, to ensure that anything committed and included at
the end of the scan is also logged.

For SNAPSHOTSTART backups, we need to ensure that the log contains all
changes going back to before the first changes recorded by the scan
so that Undo takes us to a consistent point.  This means that we will
apply all of the changes down to *but not including* the WAIT_GCP_REQ
GCI.

1) It is not guaranteed that every log entry contains a GCP. If
there is a block of entries which contain the same GCP, the GCP is
logged for the first entry and left blank for the remaining entries.
This is problematic for undo logging since the file is read from end
to beginning.

This is fixed by modifying backup to log the GCP for every undo log
entry. This does not break the format since it is still within the
spec. GCP inclusion is critical for the correct behaviour in undo
log replay, so there will be a behavioural difference while using
ndb_restore on 'before patch' and 'after patch' backups.

In the worst case, where all the log entries have the same GCI, this
change adds an overhead of 4 bytes for every log entry except the last.
A log entry is 12 bytes long before data and keys are included.

The format of redo log entries is not changed, so there is no change
to SNAPSHOTEND backup or restore.

2) Undo logging begins before startGCP, so the undo log may contain
entries with GCP <= startGCP. Also, log entries are not ordered by GCP,
so there may be entries from different GCPs interleaved in the log file.

This is fixed by modifying ndb_restore undo log replay to replay log
entries only if their GCPs > startGCP. So the undo logfile is still
read from EOF to file-start, but all the log entries with GCPs <=
startGCP are skipped. Undo log replay does not terminate when a log
entry with GCP <= startGCP is found, since there may still be entries
with GCP > startGCP beyond the current entry.

No change to SNAPSHOTEND backup or restore.

3) The consistent GCP should be restored into ndb_apply_status.
ndb_restore always restores the stopGCP as the consistent GCP, which
is incorrect for SNAPSHOTSTART backups.

This is fixed by modifying ndb_restore to restore the startGCP to
ndb_apply_status for SNAPSHOTSTART backups

No change to SNAPSHOTEND backup or restore.

The existing behaviour will be retained in these 2 cases:

a) If patched ndb_restore used on 'old' SNAPSHOTSTART backup from
non-patched data node: undo log is replayed to start of log, so
log entries prior to startGCP may be replayed.

b) If unpatched ndb_restore used on fixed SNAPSHOTSTART backup from
patched data node: undo log entries with GCP <= startGCP will be
skipped, but ndb_restore may assign incorrect GCP to some log entries,
possibly resulting in replay of entries with GCP <= startGCP.

ConsistencyUnderLoad testcases are added for SNAPSHOTSTART backups.
The GCP stall testcase stalls the GCP at the beginning of the backup
to ensure that the backup log includes log entries from startGCP-1.
The test then checks that these entries have not been replayed.

These tests replace a less comprehensive mtr test, which is
removed.

28f2a939

Bug #27566346 NDB : BACKUP WITH SNAPSHOTSTART CONSISTENCY ISSUES

Priyanka Sangam authored Mar 26, 2019



Reviewed-by: Frazer Clement <frazer.clement@oracle.com>

A backup needs to be restored to a consistent point, for which it
uses a fuzzy scan and a log.

The fuzzy scan is restored, and then the log is replayed
idempotently up to some consistent point which is after
(SNAPSHOTEND) or before (SNAPSHOTSTART) any of the states captured
in the scan.

This requires that the backup is captured in order :
1) Start recording logs of all committed transactions
2) Choose SNAPSHOTSTART consistent point, save as StartGCP
3) Perform data scan
4) Choose SNAPSHOTEND consistent point, save as StopGCP
5) Stop recording logs

Choosing the consistent point is done by sending WAIT_GCP_REQ to
DIH which blocks until the open GCI at the time it was received
has been completed (subject to being fixed in Bug 27497461),
then returns that GCI number. So the consistent point is the *end*
 of the WAIT_GCP_REQ GCI.

For SNAPSHOTEND backups, we need to ensure that the log contains
all changes up to + beyond the last changes recorded by the scan
so that redo takes us to a consistent point. This means that we
will apply all of the changes *up to and including* the
StopGCP to recover to a consistent point. We must include
the StopGCP, to ensure that anything committed and included at
the end of the scan is also logged.

For SNAPSHOTSTART backups, we need to ensure that the log contains all
changes going back to before the first changes recorded by the scan
so that Undo takes us to a consistent point.  This means that we will
apply all of the changes down to *but not including* the WAIT_GCP_REQ
GCI.

1) It is not guaranteed that every log entry contains a GCP. If
there is a block of entries which contain the same GCP, the GCP is
logged for the first entry and left blank for the remaining entries.
This is problematic for undo logging since the file is read from end
to beginning.

This is fixed by modifying backup to log the GCP for every undo log
entry. This does not break the format since it is still within the
spec. GCP inclusion is critical for the correct behaviour in undo
log replay, so there will be a behavioural difference while using
ndb_restore on 'before patch' and 'after patch' backups.

In the worst case, where all the log entries have the same GCI, this
change adds an overhead of 4 bytes for every log entry except the last.
A log entry is 12 bytes long before data and keys are included.

The format of redo log entries is not changed, so there is no change
to SNAPSHOTEND backup or restore.

2) Undo logging begins before startGCP, so the undo log may contain
entries with GCP <= startGCP. Also, log entries are not ordered by GCP,
so there may be entries from different GCPs interleaved in the log file.

This is fixed by modifying ndb_restore undo log replay to replay log
entries only if their GCPs > startGCP. So the undo logfile is still
read from EOF to file-start, but all the log entries with GCPs <=
startGCP are skipped. Undo log replay does not terminate when a log
entry with GCP <= startGCP is found, since there may still be entries
with GCP > startGCP beyond the current entry.

No change to SNAPSHOTEND backup or restore.

3) The consistent GCP should be restored into ndb_apply_status.
ndb_restore always restores the stopGCP as the consistent GCP, which
is incorrect for SNAPSHOTSTART backups.

This is fixed by modifying ndb_restore to restore the startGCP to
ndb_apply_status for SNAPSHOTSTART backups

No change to SNAPSHOTEND backup or restore.

The existing behaviour will be retained in these 2 cases:

a) If patched ndb_restore used on 'old' SNAPSHOTSTART backup from
non-patched data node: undo log is replayed to start of log, so
log entries prior to startGCP may be replayed.

b) If unpatched ndb_restore used on fixed SNAPSHOTSTART backup from
patched data node: undo log entries with GCP <= startGCP will be
skipped, but ndb_restore may assign incorrect GCP to some log entries,
possibly resulting in replay of entries with GCP <= startGCP.

ConsistencyUnderLoad testcases are added for SNAPSHOTSTART backups.
The GCP stall testcase stalls the GCP at the beginning of the backup
to ensure that the backup log includes log entries from startGCP-1.
The test then checks that these entries have not been replayed.

These tests replace a less comprehensive mtr test, which is
removed.