Motr  M0
Copy Packet DLD

Overview

Copy packet is the data structure used to describe the movement of a piece of re-structured data between various copy machine replica nodes and within the same replica. It is an entity which has data as well as operations. Copy packets are FOMs of special type, created when a data re-structuring request is posted to replica.

Copy packet processing logic is implemented in a non-blocking way. Packet has buffers to carry data and FOM for execution in context of request handler. It can perform different work which depends on its phase (i.e. FOM phase) in execution.


Definitions

  • Copy Packet: A chunk of data traversing through the copy machine.
  • Copy packet acknowledgement: Reply received, representing successful processing of the copy packet. With this acknowledgement, copy packet releases various resources and updates its internal state.
  • Next phase function: Given a copy packet, this identifies the phase that has to be assigned to this copy packet. The next phase function (m0_cm_cp_ops::co_phase_next()) determines the routing and execution of copy packets through the copy machine.

Requirements

  • r.cm.cp Copy packet abstraction implemented such that it represents the data to be transferred within replica.
  • r.cm.cp.async Every read-write (receive-send) by replica should follow the non-blocking processing model of Motr design.
  • r.cm.buffer_pool Copy machine should provide a buffer pool, which is efficiently used for copy packet data.
  • r.cm.cp.bulk_transfer All data packets (except control packets) that are sent over RPC should use bulk-interface for communication.
  • r.cm.cp.fom.locality Copy packet FOMs should be efficiently assigned request handler locality without causing any deadlock or data corruption.
  • r.cm.addb Copy packet should have its own addb context, (similar to fom), although it uses different addb locations, this will trace the entire path of the copy packet.

Dependencies

  • r.cm.service Copy packet FOMs are executed in context of copy machine replica.
  • r.cm.ops Replica provides operations to create, configure and execute copy packet FOMs.
  • r.layout Data restructuring needs layout info.
  • r.layout.input-iterator Iterate over layout info to create packets and forward it in replica.
  • r.resource Resources like buffers, CPU cycles, network bandwidth, storage bandwidth are needed by copy packet FOM during execution.
  • r.confc Data from configuration will be used to initialise copy packets.

Design Highlights

  • Copy packet is implemented as FOM, which inherently has non-blocking model of motr.
  • Distributed sliding window algorithm is used to process copy packets within copy machine replica.
  • Layout is updated periodically as the restructuring progresses.

Logical Specification

Component Overview

Copy packet functionality is split into two parts:

  • generic functionality, implemented by cm/cp.[hc] and
  • copy packet type functionality which is based on copy machine type. (e.g. SNS, Replication, &c).

Copy packet creation: Given the size of the buffer pool, the replica calculates its initial sliding window (

See also
m0_cm_sw). Once the replica learns windows of every other replica, it can produce copy packets that replicas (including this one) are ready to process.

Copy packet is created when,

  • replica starts. It should be made sure that sliding window has enough packets for processing by creating them at start.
  • has space. After completion of each copy packet, space in sliding window is checked. If space exists, then copy packets will be created.

Copy Packet destruction: Copy packet is destroyed by setting its phase to M0_CCP_FINI. Following are some cases where copy packet is finalised.

  • On notification of copy packet data written to device/container.
  • During transformation, packets that are no longer needed, are finalised.
  • On completion of copy packet transfer over the network.

Copy packet cooperation within replica: Copy packet needs resources (memory, processor, &c.) to do processing:

  • Needs buffers to keep data during IO.
  • Needs buffers to keep data until the transfer is finished.
  • Needs buffers to keep intermediate checksum until all units of an aggregation group have been received.

The copy packet (and its associated buffers) will go through various phases. In a particular scenario where data read from device creates a copy packet, then copy packet transitions to data transformation phase, which, after reconstructing the data, transitions to data write or send, which submits IO. On IO completion, the copy packet is destroyed.

Copy machine provides and manages resources required by the copy packet. e.g. In case of SNS Repair, copy machine creates 2 buffer pools, for incoming and outgoing copy packets. Based on the availability of buffers in these buffer pools, new copy packets are created. On finalisation of a copy packet, the corresponding buffers are released back to the respective buffer pool.

State Specification

Copy packet is a state machine that goes through following phases:

  • INIT Copy packet gets initialised with input data. e.g In SNS, extent, COB, &c gets initialised. Usually this is done with some iterator over layout info. (m0_cm_cp_phase::M0_CCP_INIT)
  • READ Reads data from its associated container or device according to the input information, and places the data in a copy packet data buffer. Before doing this, it needs to grab necessary resources: memory, locks, permissions, CPU/disk bandwidth, etc. Data/parity is encapsulated in copy packet, and the copy packets are transfered to next phase. (m0_cm_cp_phase::M0_CCP_READ)
  • WRITE Writes data from copy packet data buffer to the container or device. Spare container and offset to write is identified from layout information. (m0_cm_cp_phase::M0_CCP_WRITE)
  • XFORM Data restructuring is done in this phase. This phase would typically process a lot of local copy packets. E.g., for SNS repair machine, a file typically has a component object (cob) on each device in the pool, which means that a node could (and should) calculate "partial parity" of all local units, instead of sending each of them separately across the network to a remote copy machine replica. (m0_cm_cp_phase::M0_CCP_XFORM)
  • IOWAIT Waits for IO to complete. (m0_cm_cp_phase::M0_CCP_IO_WAIT)
  • SW_CHECK Checks if the copy packet is in sliding window. If it is not, then waits in this phase till it fits in the sliding window.
  • SEND Send copy packet over network. Control FOP and bulk transfer are used for sending copy packet. (m0_cm_cp_phase::M0_CCP_SEND)
  • SEND_WAIT Waits till the acknowledgement is received that copy packet has been reached to the destination.
  • BUF_ACQ Acquire the buffers based on the control fop information.
  • RECV_INIT After acquiring required number of buffers, copy packet FOM transitions to m0_cm_cp_phase::M0_CCP_RECV_INIT phase and initiates zero copy using rpc_bulk.
  • RECV_WAIT Zero copy is completed. Any cleanup, if is done in this phase.
  • FINI Finalises copy packet.

Specific copy packet can have phases in addition to these phases. Additional phases may be used to do processing for copy packet specific functionality. Handling of additional phases also can be done using next phase function, as implementation of next phase function is also specific to copy packet type.

Transition between standard phases is done by next phase function. It will produce the next phase according to the configuration of the copy machine and the copy packet itself.

State diagram for copy packet:

dot_inline_dotgraph_6.png

Transformation in multinode environment

When copy packet fom enters transformation phase, it calculates partial parity on that particular node. This calculation is based on the incoming copy packets for an aggregation group. In case of multinode data restructuring, transformation is executed locally i.e. along outgoing path as well as along the incoming path.

Outgoing path

The transformed copy packet contains partial parity of the local copy packets belonging to a particular aggregation group. The transformed copy packet can either be written locally or can be sent to remote destination node.

Incoming path

The transformed copy packet contains the partial parity of the copy packets which are received from other nodes as well as local copy packets. This is executed typically on the destination node (i.e. node on which spare units are allocated, in case of repair operation). Transformation phase function inherently waits for all the copy packets in an aggregation group to be transformed. For this to happen, transformation function has to do bookkeeping of following information:

  • number of copy packets that have been transformed for a particular aggregation group (m0_cm_aggr_group::cag_transformed_cp_nr).
  • indices of the copy packets in an aggregation group that have been transformed (this knowledge is required by parity recovery algorithm like Reed-Solomon) This is stored using a bitmap (m0_cm_cp::c_xform_cp_indices).

The index of the copy packet in an aggregation group is stored by the iterator in m0_cm_cp::c_ag_cp_idx. This index is used by the transformation function to populate the bitmap (m0_cm_cp::c_xform_cp_indices). Note: This index should be global index of a copy packet in an aggregation group.

For any aggregation group, transformation is marked as complete, iff all indices in the bitmap are set to true.

Threading and Concurrency Model

Copy packet is implemented as a FOM and thus do not have its own thread. It runs in the context of reqh threads. So FOM locality group lock (i.e m0_cm_cp:c_fom:fo_loc:fl_group:s_lock) is used to serialise access to m0_cm_cp and its operation.


Conformance

  • i.cm.cp Replicas communicate using copy packet structure.
  • i.cm.cp.async Copy packet are implemented as FOM. FOM in request handler infrastructure makes it non-blocking.
  • i.cm.buffer_pool Buffer pools are managed by copy machine which cater to the requirements of copy packet data.
  • i.cm.cp.bulk_transfer All data packets (except control packets) that are sent over RPC, use bulk-interface for communication.
  • i.cm.cp.fom.locality Copy machine implements its type specific m0_cm_cp_ops::co_home_loc_helper().
  • i.cm.cp.addb copy packet uses ADDB context of copy machine.

Unit Tests

  • Basic Test: Alloc, Init, fini and free.
  • Test storage phases (write, read and then verify).
  • Test transformation phase. Wait in the transformation phase till the bitmap in the transformed copy packet has all its bits set to true.

System Tests


References

For documentation links, please refer to this file : doc/motr-design-doc-list.rst

  • HLD of SNS Repair
  • HLD of Copy machine and agents