Motr  M0
DIX copy machine DLD

Overview

This module implements DIX copy machine using generic copy machine infrastructure. DIX copy machine is built upon the request handler service. The same DIX copy machine can be configured to perform multiple tasks, repair and rebalance using parity de-clustering layout. DIX copy machine is typically started during Motr process startup, although it can also be started later.


Definitions

Please refer to "Definitions" section in "HLD of copy machine and agents" and References


Requirements

  • r.dix.cm.aggregation.group Aggregation groups should be implemented taking into account that it should lead to minimal changes into common framework.
  • r.dix.cm.data.next The implementation should efficiently select next data to be processed without causing any deadlock or bottle neck.
  • r.dix.cm.report.progress The implementation should efficiently report overall progress of data restructuring and update corresponding layout information for restructured objects.
  • r.dix.cm.repair.trigger For repair, DIX copy machine should respond to triggers caused by various kinds of failures.
  • r.dix.cm.repair.iter For repair, DIX copy machine iterator should iterate over parity group units on the survived component catalogues and accordingly write the lost data to spare units of the corresponding parity group.
  • r.dix.cm.rebalance.iter For rebalance, DIX copy machine iterator should iterate over the spare units of the repaired parity groups and copy the data from corresponding spare units to the target unit on the new device.
  • r.dix.cm.be.btree The implementation should work with BE btree directly without calling routines of CAS service.

Dependencies

  • r.dix.cm.resources.manage It must be possible to efficiently manage and throttle resources.

    Please refer to "Dependencies" section in "HLD of copy machine and agents" and "HLD of SNS Repair" in References


Design Highlights

  • DIX copy machine uses request handler service infrastructure.
  • DIX copy machine specific data structure embeds generic copy machine and other DIX repair specific objects.
  • DIX copy machine defines its specific aggregation group data structure which embeds generic aggregation group.
  • Once initialised DIX copy machine remains idle until failure is reported.
  • DIX copy machine creates copy packets and allocates buffers dynamically without using of any pools.
  • Failure triggers DIX copy machine to start repair operation.
  • For multiple nodes, DIX copy machine maintains a local proxy of every other remote replica in the cluster.
  • For multiple nodes, DIX copy machine sets infinite bounds for its sliding window and communicates it to other replicas identified by the local proxies through READY FOPs.
  • Every node that serves some unit (from the same parity group) that contains data (as spare units can contain data as well) has enough information to restore parity by sending of served unit. To make this process more deterministic the following rule can be used: if some node serves unit with the lowest number (in scope of parity group) that contains data - this node is responsible for restoring of parity.
  • No data transformation is needed by DIX repair/re-balance processes.
  • Aggregation groups are not really needed by DIX repair/re-balance proccesses and can be implemented rudimentary. Such solution allows to use the code of generic copy machine without significant modification.
    Todo:
    It would be nice to find the solution where implementation of any kind of aggregation groups is not necessary.

Logical specification

Component overview

The focus of DIX copy machine is to efficiently restructure (repair or re-balance) data in case of failures, viz. device, node, etc. The restructuring operation is split into various copy packet phases.

Copy machine setup

DIX copy machine service allocates and initialises the corresponding copy machine. After cm_setup() is successfully called, the copy machine transitions to M0_CMS_IDLE state and waits until failure happens. As mentioned in the HLD, failure information is a broadcast to all the replicas in the cluster using TRIGGER FOP. The FOM corresponding to the TRIGGER FOP activates the DIX copy machine to start repair operation by invoking m0_cm_start(), this invokes DIX copy machine specific start routine which initialises specific data structures.

Once the repair operation is complete the same copy machine is used to perform re-balance operation. In re-balance operation the data from the containing unit with the lowest index in the repaired parity group is copied to the new device using the layout.

Copy machine ready

Once copy machine is initialised it is ready to start repair/re-balance process.

Copy machine startup

Starts and initialises DIX copy machine data iterator.

See also
m0_dix_cm_iter_start()

Copy machine data iterator

DIX copy machine implements an iterator to efficiently select next data to process. This is done by implementing the copy machine specific operation, m0_cm_ops::cmo_data_next(). The following pseudo code illustrates the DIX data iterator for repair as well as re-balance operation,

- for each component catalogue C in ctidx
- fetch layout L for C
// proceed in local key order (keys belong to C).
- for each local key I belonging to C
// determine whether group S that I belongs to needs reconstruction.
- if no device id is in the failure set continue to the next key
// group has to be reconstructed, check whether the unit that contains
// data and has the lowest number in group S is served locally
- for each data, parity and spare unit U in S (0 <= U < N + 2K)
- if U is local && U contains data && U number is min
- determine destination

The above iterator iterates through each component catalogue in catalogue-index catalogue in record key order and determines whether corresponding parity group needs reconstruction, if yes then checks whether it serves the unit that contains data and has the lowest index in scope of parity group, if so then this node is responsible for data reconstruction. After that the destination is determined and copy packet is created.

Copy machine sliding window

DIX copy machine supports only infinite sliding window that does not need to be maintained. It is caused by the following reasons:

  • DIX repair/rebalance processes have no data transformation, so it is not needed to regulate data reconstruction process using sliding window
  • Aggregation groups IDs can not be determined and ordered for distributed indices to apply sliding window Some mandatory callbacks can be implemented as stubs.

Copy machine stop

Once all the component objects corresponding to the distibuted indices belonging to the failure set are re-structured (repair or re-balance) by every replica in the cluster successfully, the re-structuring operation is marked complete.

Threading and Concurrency Model

DIX copy machine is implemented as a request handler service, thus it shares the request handler threading model and does not create its own threads. All the copy machine operations are performed in context of request handler threads.

DIX copy machine uses generic copy machine infrastructure, which implements copy machine state machine using generic Motr state machine infrastructure. State machine

Locking All the updates to members of copy machine are done with m0_cm_lock() held.

NUMA optimizations

N/A


Conformance

  • i.dix.cm.aggregation.group Aggregation groups should be implemented taking into account that it should lead to minimal changes into common framework.
  • i.dix.cm.data.next DIX copy machine implements a next function using catalogue-index catalogue iterator and pdclust layout infrastructure to select the next data to be repaired from the failure set. This is done in component catalogue fid order.
  • i.dix.cm.report.progress The implementation should efficiently report overall progress of data restructuring and update corresponding layout information for restructured objects.
  • i.dix.cm.repair.trigger Various failures are reported through TRIGGER FOP, which create corresponding FOMs. FOMs invoke dix specific copy machine operations through generic copy machine interfaces which cause copy machine state transitions.
  • i.dix.cm.repair.iter For repair, DIX copy machine iterator iterates over record keys, determines whether servived data containing unit with the lowest index in scope of parity group is served locally to use it for lost data reconstruction.
  • i.dix.cm.rebalance.iter For rebalance, DIX copy machine acts like the repair iterator to copy the data to the corresponding target units on the new device.
  • i.dix.cm.be.btree DIX copy machine calls BE btree interfaces without calling of routines of CAS service.

Unit tests

N/A


System tests

N/A


Analysis

N/A


References

Following are the references to the documents from which the design is derived. For documentation links, please refer to this file : doc/motr-design-doc-list.rst

  • Copy Machine redesign
  • HLD of copy machine and agents
  • HLD of SNS Repair