-
Ole John Aske authored
When another mysqld node is started, and joins (subscribe to) the schema distribution protocol, another mysqld which is waiting for a schema change to be distributed will timeout during that wait. That happens as we incorrectly assumed that the new arriving mysqld node would also 'ack' the schema distribution. However, it arrived too late to be a participant in it. This patch fixes 3 issues all contributing to this failure: a) There is a potential race between an 'inflight' subscribe event, and the start of a schema distribution. The subscribing node might or might not take part in the schema distribution, and its role is actually unknown at the point in time where the schema operation is started by the coordinator. The set of participating servers could only be determined when the Coordinator acks its own schema op: If the subscribe event arrived before it own schema up, then the subcribing node is a participant. This patch modifies the Coordinators ack to also modifying the acked slock_bitmap to clear the servers *not* participating. b) check_wakeup_clients() called get_subcriber_bitmask() to get the current set of subscribers. However, 'self' was not included in the subscribers, which it always should be. Fixed this by letting Ndb_schema_dist_data::init() add 'own_nodeid' to subscribers. Furthermore, this enables us to clean up a couple of places where we used to add own_nodeid to the set retrieved from get_subscribers_bitmask(). c) handle_clear_slock() copied schema->slock into ndb_schema_object->slock_bitmap, thereby overwriting the intersect done as part of a). Changed the copy to do an intersect instead. This patch also modifies several places where schema distribution progress is printed: - Always print more significant part of bitmask before the less significant. - Adds some formating when printing the bitmasks. Also removes a few clear of bitmasks immediately after an init, which is redundant as ::init() also cleared it.
Ole John Aske authoredWhen another mysqld node is started, and joins (subscribe to) the schema distribution protocol, another mysqld which is waiting for a schema change to be distributed will timeout during that wait. That happens as we incorrectly assumed that the new arriving mysqld node would also 'ack' the schema distribution. However, it arrived too late to be a participant in it. This patch fixes 3 issues all contributing to this failure: a) There is a potential race between an 'inflight' subscribe event, and the start of a schema distribution. The subscribing node might or might not take part in the schema distribution, and its role is actually unknown at the point in time where the schema operation is started by the coordinator. The set of participating servers could only be determined when the Coordinator acks its own schema op: If the subscribe event arrived before it own schema up, then the subcribing node is a participant. This patch modifies the Coordinators ack to also modifying the acked slock_bitmap to clear the servers *not* participating. b) check_wakeup_clients() called get_subcriber_bitmask() to get the current set of subscribers. However, 'self' was not included in the subscribers, which it always should be. Fixed this by letting Ndb_schema_dist_data::init() add 'own_nodeid' to subscribers. Furthermore, this enables us to clean up a couple of places where we used to add own_nodeid to the set retrieved from get_subscribers_bitmask(). c) handle_clear_slock() copied schema->slock into ndb_schema_object->slock_bitmap, thereby overwriting the intersect done as part of a). Changed the copy to do an intersect instead. This patch also modifies several places where schema distribution progress is printed: - Always print more significant part of bitmask before the less significant. - Adds some formating when printing the bitmasks. Also removes a few clear of bitmasks immediately after an init, which is redundant as ::init() also cleared it.
Loading