Rconfc state machine

Color agenda:
green - States during startup or reelection
pink - Reelection-only states
dark grey - Stopping states
After successful start rconfc is in M0_RCS_IDLE state, waiting for one of two events: read lock conflict or user request for stopping. These two events are handled only when rconfc is in M0_RCS_IDLE state. If rconfc was in other state, then a fact of the happened event is stored, but its handling is delayed until rconfc state is M0_RCS_IDLE.

If failure is occurred that prevents rconfc from functioning properly, then rconfc goes to M0_RCS_FAILURE state. SM in this state do nothing until user requests for stopping.

Rconfc internal state is protected by SM group lock. SM group is provided by user on rconfc initialisation.

Request cluster entry point from HA

The first stage of rconfc startup is determining the entry point of motr cluster, which configuration should be accessed. The entry point consists of several components. All of them can be changed during cluster lifetime.

Cluster entry point includes:

List of confd servers fids along with RPC endpoints.
Fid and RPC endpoint of active RM creditor that manages concurrent access to the cluster configuration database.
Quorum value. Minimum number of confd servers running the same configuration version number necessary to elect this version.

HA subsystem is responsible for serving queries for current cluster entry point. Rconfc makes query to HA subsystem through a local HA agent.

It may happen that rconfc is not able to succeed with version election for some reason, (e.g. connection to active RM cannot be established, current set of confds reported by HA does not yield the quorum, etc.) In this case rconfc repeats entry point request to HA and attempts to elect version with the most recent entry point data set. There is no limit imposed on the number of attempts.

Read Lock Acquisition and Revocation

During m0_rconfc_start() execution rconfc requesting read lock from Resource Manager (RM) by calling rconfc_read_lock_get(). On request completion rconfc_read_lock_complete() is called. Successful lock acquisition indicates no configuration change is in progress and configuration reading is allowed.

The read lock is retained by rconfc instance until finalisation. But the lock can be revoked by RM in case a conflicting lock is requested. On the lock revocation rconfc_read_lock_conflict() is called. The call installs m0_confc_gate_ops::go_drain() callback to be notified when the last reading context is detached from m0_rconfc::rc_confc instance. The callback ends in calling rconfc_gate_drain() where rconfc starts conductor cache drain. In rconfc_conductor_drained() rconfc eventually puts the read lock back to RM.

Being informed about the conflict, rconfc disallows configuration reading done via m0_rconfc::rc_confc until the next read lock acquisition is complete. Besides, in rconfc_conductor_drain() the mentioned confc's cache is drained to prevent consumer from reading cached-but-outdated configuration values. However, the cache data remains untouched and readable to the very moment when there is no cache object pinned anymore, and the last reading context detaches from the confc being in use.

When done with the cache, m0_rconfc::rc_confc is disconnected from confd server to prevent unauthorized read operations. Then the conflicting lock is returned back to RM complying with the conflict request.

Immediately after revocation rconfc attempts to acquire read lock again. The lock will be granted once the conflicting lock is released.

Version Election and Quorum

In the course of rconfc_read_lock_complete() under condition of successful read lock acquisition rconfc transits to M0_RCS_VERSION_ELECT state. It initialises every confc instance of the m0_rconfc::rc_herd list, attaches rconfc__cb_quorum_test() to its context and initiates asynchronous reading from the corresponding confd server. When version quorum is either reached or found impossible rconfc_version_elected() is called.

On every reading event rconfc__cb_quorum_test() is called. In case the reading context is not completed, the function returns zero value indicating the process to go on. Otherwise rconfc_quorum_test() is called to see if quorum is reached with the last reply. If quorum is reached or impossible, then rconfc_version_elected() is called.

Quorum is considered reached when the number of confd servers reported the same version number is greater or equal to the value provided to m0_rconfc_init(). In case zero value was provided, the required quorum number is automatically calculated as a half of confd server count plus one.

If quorum is reached, rconfc_conductor_engage() is called connecting m0_rconfc::rc_confc with a confd server from active list. Starting from this moment configuration reading is allowed until read lock is revoked.

If quorum was not reached, rconfc repeats request to HA about entry point information and starts new version election with the most recent entry point data set.

Processing HA notifications

Rconfc is interested in the following notifications from HA:

Permanent failure of active RM creditor.
Permanent failure of one of confd servers from the herd.

In order to receive these notifications rconfc creates phony confc (m0_rconfc::rc_phony) and adds fake objects for RM creditor service and confd services upon receiving cluster entry point. Using general non-phony confc instance is not possible, because configuration version election isn't done to that moment.

Actions performed on RM creditor death:

If read lock is not acquired yet, then rconfc restarts election process from requesting cluster entry point. HA subsystem is expected to return information about newly chosen active RM creditor or an error if it was unable to choose one. Please note that if HA subsystem constantly returns already dead RM creditor, then rconfc will go to infinite loop.
If read lock is held by rconfc, then RM creditor death is observed by checking local owner state in rconfc_read_lock_conflict(). The thing is that local owner also tracks changes in RM creditor HA state (see rlock_ctx_creditor_setup()). On RM creditor death owner goes to ROS_QUIESCE state and calls conflict callbacks for all held credits. Rconfc unsets local owner creditor and restarts election process in order to receive newly chosen creditor from HA.

Actions performed on death of confd server from herd:

Drop connection.
Finalise internal confc.
Mark herd link as CONFC_DEAD, so this confd doesn't participate in possible confd switch (see Reconnecting confc to another confd).

Death notification is basically handled by rconfc_link::rl_fom that is queued from rconfc_herd_link__on_death_cb(). The FOM is intended to safely disconnect herd link from problematic confd when session and connection termination may be timed out. The FOM prevents client's locality from being blocked for a noticeably long time.

                                    |  m0_fom_init()
      !m0_confc_is_inited() ||      |  m0_fom_queue()
      !m0_confc_is_online()         V
   +--------------------------- M0_RLF_INIT
   |                                |
   |                                |  wait for M0_RPC_SESSION_IDLE
   |                                V
   +---------------------- M0_RLF_SESS_WAIT_IDLE
   |                                |
   |                                |  m0_rpc_session_terminate()
   |                                V
   +---------------------- M0_RLF_SESS_TERMINATING
   |                                |  m0_rpc_session_fini()
   |                                |  m0_rpc_conn_terminate()
   |                                V
   +---------------------- M0_RLF_CONN_TERMINATING
   |                                |  m0_rpc_conn_fini()
   |                                V
   +--------------------------->M0_RLF_FINI
                                    |  m0_fom_fini()
                                    |  rconfc_herd_link_fini()
                                    V

Attention: Currently HA notifications processing doesn't take "conductor" confc into account. This confc instance is separated from those used in herd and is not affected even if death of confd server it communicates with is observed. It is assumed that RPC eventually will return an error and "conductor" confc will be reconnected to another confd.

Note

There are no special HA notifications about the fact that confd servers list has changed. In order to make rconfc logic correct in such case the following behaviour is expected from HA:

Confd server can't be excluded from list without prior HA notification about permanent failure of this confd server. Rconfc will receive this notification and will stop working with it.
Confd server can't be added to the list without prior configuration database update, that adds this service to database. Rconfc will observe read lock conflict and eventually will restart election process, thus obtaining updated confd list.

Gating confc operations

Blocking confc context initialisation

Rconfc performs gating read operations conducted through the confc instance governed by the rconfc, i.e. m0_rconfc::rc_confc. When read lock is acquired by rconfc, the reading is allowed. To be allowed to go on with reading, m0_confc_ctx_init() performs checking by calling previously set callback m0_confc::cc_gops::go_check(), that in fact is rconfc_gate_check().

With the read lock revoked inside rconfc_gate_check() rconfc blocks any m0_confc_ctx_init() calls done with this particular m0_rconfc::rc_confc. On next successful read lock acquisition all the previously blocked contexts get unblocked. Once being allowed to read, the context can be used as many times as required.

Diag.1: "Reading allowed at the moment of context initialisation"

Diag.2: "Reading disallowed at the moment of context initialisation"

Diag.3: "Reading remains disallowed because of RM communication failure"

Cleaning confc cache data

When new configuration change is in progress, and therefore, read lock is revoked, rconfc_read_lock_conflict() defers cache draining until there is no reading context attached. It installs m0_confc::cc_gops::go_drain() callback, that normally remains set to NULL and this way does not affect execution of m0_confc_ctx_fini() anyhow. But with the callback set up, at the moment of the very last detach m0_confc_ctx_fini() calls m0_confc::cc_gops::go_drain() callback, that in fact is rconfc_gate_drain(), where cache cleanup is finally invoked by setting M0_RCS_CONDUCTOR_DRAIN state. Rconfc SM remains in M0_RCS_CONDUCTOR_DRAIN_CHECK state until all conf objects are unpinned. Once there are no pinned objects, rconfc cleans cache, put read lock and starts reelection process.

Diag.4: "Deferred Cache Cleanup"

Note: Forced cache draining occurs when m0_confc_gate_ops::go_drain callback is installed, which happens only when reading is not allowed. Normally the callback is set to NULL, and therefore, confc cache remains unaffected during m0_confc_ctx_fini().

Reconnecting confc to another confd

In case configuration reading fails because of network error, the confc context requests the confc to skip its current connection to confd and switch to some other confd server running the same version. This is done inside state machine being in S_SKIP_CONFD state by calling callback function m0_confc::cc_gops::go_skip() that in fact is rconfc_gate_skip(). The function iterates through the m0_rconfc::rc_active list and returns on the first successful connection established. In case of no success, the function returns with -ENOENT making the state machine end in S_FAILURE state.

Note: As long as confc is switched to confd of the same version number, the cache data remains valid and needs no special attendance.

Cleaning configuration cache during stopping.

When rconfc is stopping, it scans configuration for pinned objects (i. e. objects with m0_conf_obj::co_nrefs > 0). If such object is found then rconfc waits until it will be unpinned by a configuration consumer. The consumer must be subscribed to m0_reqh::rh_confc_cache_expired chan and put its pinned objects in the callback registered with this chan. When all configuration objects become unpinned, rconfc is able to clean configuration cache and go to M0_RCS_FINAL state.