Motr  M0
DLD of Bulk Server

Overview

This document contains the detailed level design of the Bulk I/O Service.

Purpose
The purpose of this document is to:

  • Refine higher level designs
  • To be verified by inspectors and architects
  • To guide the coding phase

Definitions

Terms used in this document defined as below :

  • Bulk I/O Service Motr ioservice which process read/write FOPs.
  • FOP File operation packet, a description of file operation suitable for sending over network or storing on a storage device. File operation packet (FOP) identifies file operation type and operation parameters.
  • FOM FOP state machine (FOM) is a state machine that represents current state of the FOP's execution on a node. FOM is associated with the particular FOP and implicitly includes this FOP as part of its state.
  • zero-copy Copy between a source and destination takes place without any intermediate copies to staging areas.
  • STOB Storage object (STOB) is a basic M0 data structure containing raw data.
  • COB Component object (COB) is a component (stripe) of a file, referencing a single storage object and containing metadata describing the object.
  • rpc_bulk Generic interface for zero-copy.
  • buffer_pool Pre-allocated & pre-registered pool of buffers. Buffer pool also provides interfaces to get/put buffers. Every Bulk I/O Service initiates its buffer_pool.
  • Configuration cache Configuration data being stored in node’s memory.

Requirements

  • r.bulkserver.async Bulk I/O server runs asynchronously.
  • r.non-blocking.few-threads Motr service should use a relatively small number of threads: a few per processor.
  • r.non-blocking.easy Non-blocking infrastructure should be easy to use and non-intrusive.
  • r.non-blocking.extensibility Addition of new "cross-cut" functionality (e.g., logging, reporting) potentially including blocking points and affecting multiple fop types should not require extensive changes to the data-structures for each fop type involved.
  • r.non-blocking.network Network communication must not block handler threads.
  • r.non-blocking.storage Storage transfers must not block handler threads.
  • r.non-blocking.resources Resource acquisition and release must not block handler threads.
  • r.non-blocking.other-block Other potentially blocking conditions (page faults, memory allocations, writing trace records, etc.) must never block all service threads.

Design Overview

Bulk I/O Service will be available in the form of state machine to process bulk I/O request. It uses generic rpc_bulk interface to use zero-copy RDMA mechanism from transport layer to copy data from source to destination. It also use STOB I/O interface to complete the I/O operation.

Bulk I/O Service implements I/O FOMs to process I/O FOPs Fop State Machines for IO FOPs.

The Bulk I/O Service interface m0_ioservice_fop_init() registers and initiates I/O FOPs with it. Following are the Bulk I/O Service FOP type.

The Bulk I/O Service initiates buffer_pool during its initialization. Bulk I/O Service gets buffers required from buffer_pool and pass it to rpc_bulk for zero-copy. Buffers then returns back to buffer pool after data written on STOB.

The Bulk I/O Service initialization done by request handler during its startup.


Logical Specification

Sequence diagram

This section describes how client and server communications happens while processing read/write FOPs. This also shows usage of zero-copy in I/O FOP processing.

Write operation with zero-copy data transfer

msc_inline_mscgraph_6
  • Client sends write FOP to server. Write FOP contains the network buffer descriptor list and indexvecs list instead of actual data.
  • To process write FOP, request handler creates & initiates write FOM and puts it into run queue for execution.
  • State transition function go through generic and extended phases m0_io_fom_cob_rw_phases defined for I/O FOM (write FOM).
    • Gets as many buffers as it can from buffer_pool to transfer data for all descriptors. If there are insufficient buffers with buffer_pool to process all descriptors then its goes by batch by batch. At least one buffer is needed to start bulk transfer. If no buffer available then bulk I/O Service will wait till buffer_pool becomes non-empty.
    • Initiates zero-copy using rpc_bulk on acquired buffers and wait for zero-copy to complete for all descriptors on which it initiated.
    • Zero-copy completes
    • Initiates write data on STOB for all indexvec and wait for STOB I/O to complete
    • STOB I/O completes
    • Returns back some of buffers to buffer_pool if they are more than remaining descriptors.
  • Enqueue response in fo_rep_fop for the request handler to send the response back to the client

Read operation with zero-copy data transfer

msc_inline_mscgraph_7
  • Client sends read FOP to server. Read FOP contains the network buffer descriptor list and indexvecs list instead of actual data.
  • To process read FOP, request handler creates & initiates read FOM and puts it into run queue for execution.
  • State transition function go through generic and extended phases m0_io_fom_readv_phases defined for read FOM.
    • Gets as many buffers as it can from buffer_pool to transfer data for all descriptors. If there are insufficient buffers with buffer_pool to process all descriptors then its goes by batch by batch. At least one buffer is needed to start bulk transfer. If no buffer available then bulk I/O Service will wait till buffer_pool becomes non-empty.
    • Initiates read data from STOB for all indexvecs and wait for STOB I/O to completes
    • STOB I/O completes
    • Initiates zero-copy using rpc_bulk on acquired buffers and wait for zero-copy to complete for all descriptors on which it initiated.
    • Zero-copy completes
    • Returns back some of buffers to buffer_pool if they are more than remaining descriptors.
  • Enqueue response in fo_rep_fop for the request handler to send the response back to the client

On the basis of steps involved in these operations enumeration called m0_io_fom_cob_rw_phases will be defined, that extends the standard FOM phases (enum m0_fom_standard_phase) with new phases to handle the state machine that sets up and executes read/write operations respectively involving bulk I/O.


State Transition Diagrams

State Diagram For Write FOM :

dot_inline_dotgraph_16.png

Bulk I/O Service FOMs will be placed in wait queue for all states which needs to wait for task complete.

State Diagram For Read FOM :

dot_inline_dotgraph_17.png

Bulk I/O Service FOMs will be placed in wait queue for all states which needs to wait for task complete.

Buffers Management

  • Buffers Initialization & De-allocation :

I/O service maintains m0_buf_pool instance with data structure m0_reqh_service. Buffer pool m0_reqh_service::m0_buf_pool will be initialized in Bulk I/O Service start operation vector m0_io_service_start(). Bulk I/O service will use m0_buf_pool_init() to allocate and register specified number of network buffers and with specified size.

Bulk I/O Service needs following parameters from configuration database to initialize buffer pool -

IO_BULK_BUFFER_POOL_SIZE Number of network buffers in buffer pool. IO_BULK_BUFFER_SIZE Size of each network buffer. IO_BULK_BUFFER_NUM_SEGMENTS Number of segments in each buffer.

Buffer pool de-allocation takes place in service operation vector m0_io_service_stop(). I/O service will use m0_buf_pool_fini() to de-allocate & de-register the network buffers.

The buffer pool for bulk data transfer is private to the Bulk I/O service and is shared by all FOM instances executed by the service.

  • Buffer Acquire

Bulk I/O Servers acquire the network buffer by calling buffer_pool interface m0_buf_pool_get(). If buffer available with buffer_pool then this function returns network buffer. And if buffer_pool empty the function returns NULL. Then FOM need to wait for _notEmpty signal from buffer_pool.

Bulk I/O Service needs to get lock on buffer_pool instance while its request network buffer. And release lock after it get network buffer.

  • Buffer Release

Bulk I/O Servers release the network buffer by calling buffer_pool interface m0_buf_pool_put(). It return back network buffer to buffer_pool.

Bulk I/O Service needs to get lock on buffer_pool instance while it request network buffer. And release lock after it get network buffer.

  • Buffer Pool Expansion
    Todo:
    If buffer_pool reached to low threshold, Bulk I/O service may expand pool size. This can be done later to minimize waiting time for network buffer.

Service Registration

  • Service Type Declaration

Bulk I/O Service defines service type as follows -

struct m0_reqh_service_type m0_ios_type = { .rst_name = "M0_CST_IOS", .rst_ops = &ios_type_ops, .rst_level = M0_RS_LEVEL_NORMAL, .rst_typecode = M0_CST_IOS, };

It also assigns service name and service type operations for Bulk I/O Service.

  • Service Type Registration

Bulk I/O Service registers its service type with request handler using interface m0_reqh_service_type_register(). This function registers service type with global service type list for request handler. Service type operation m0_ioservice_alloc_and_init() will do this registration.

Threading and Concurrency Model

  • resources
    It uses pre-allocated and pre-registered network buffers. These buffers will not released until zero-copy completes and data from net buffers transfered to/from STOB. Since these buffers are pre-allocated & pre-registered with transport layer there should be some lock on these buffers so that no one can use same buffers.

NUMA optimizations

Dependencies

  • r.reqh : Request handler to execute Bulk I/O Service FOM
  • r.bufferpool : Network buffers for zero-copy
  • r.fop : To send bulk I/O operation request to server
  • r.net.rdma : Zero-copy data mechanism at network layer
  • r.stob.read-write : STOB I/O
  • r.rpc_bulk : For using zero-copy mechanism
  • r.configuration.caching : Configuration data being stored in node's memory.

Conformance

  • i.bulkserver.async It implements state transition interface so to run I/O bulk service asynchronously.

Unit Tests

For isolated unit tests, each function implemented as part of Bulk I/O Service needs to test separately without communicating with other modules. This is not required to use other modules which are communicating with Bulk I/O Server modules.

  • Test 01 : Call function m0_io_fom_cob_rw_create()
    Input : Read FOP (in-memory data structure m0_fop)
    Expected Output : Create FOM of corresponding FOP type.
  • Test 02 : Call function m0_io_fom_cob_rw_create()
    Input : Write FOP (in-memory data structure m0_fop)
    Expected Output : Create FOM of corresponding FOP type.
  • Test 03 : Call function m0_io_fom_cob_rw_init()
    Input : Read FOP (in-memory data structure m0_fop)
    Expected Output : Initiates FOM with corresponding operation vectors and other pointers.
  • Test 04 : Call function m0_io_fom_cob_rw_init()
    Input : Write FOP (in-memory data structure m0_fop)
    Expected Output : Initiates FOM with corresponding operation vectors and other pointers.
  • Test 05 : Call m0_io_fom_cob_rw_tick() with buffer pool size 1
    Input : Read FOM with current phase M0_FOPH_IO_FOM_BUFFER_ACQUIRE
    Expected Output : Gets network buffer and pointer set into FOM with phase changed to M0_FOPH_IO_STOB_INIT and return value M0_FSO_AGAIN.
  • Test 06 : Call m0_io_fom_cob_rw_tick() with buffer pool size 0 (empty buffer_pool)
    Input : Read FOM with current phase M0_FOPH_IO_FOM_BUFFER_ACQUIRE
    Expected Output : Should not gets network buffer and NULL pointer set into FOM with phase changed to M0_FOPH_IO_FOM_BUFFER_WAIT and return value M0_FSO_WAIT.
  • Test 07 : Call m0_io_fom_cob_rw_tick() with buffer pool size 0 (empty buffer_pool)
    Input : Read FOM with current phase M0_FOPH_IO_FOM_BUFFER_WAIT
    Expected Output : Should not gets network buffer and NULL pointer set into FOM with phase not changed and return value M0_FSO_WAIT.
  • Test 08 : Call m0_io_fom_cob_rw_tick()
    Input : Read FOM with current phase M0_FOPH_IO_STOB_INIT
    Expected Output : Initiates STOB read with phase changed to M0_FOPH_IO_STOB_WAIT and return value M0_FSO_WAIT.
  • Test 09 : Call m0_io_fom_cob_rw_tick()
    Input : Read FOM with current phase M0_FOPH_IO_ZERO_COPY_INIT
    Expected Output : Initiates zero-copy with phase changed to M0_FOPH_IO_ZERO_COPY_WAIT return value M0_FSO_WAIT.
  • Test 10 : Call m0_io_fom_cob_rw_tick() with buffer pool size 1
    Input : Write FOM with current phase M0_FOPH_IO_FOM_BUFFER_ACQUIRE
    Expected Output : Gets network buffer and pointer set into FOM with phase changed to M0_FOPH_IO_ZERO_COPY_INIT and return value M0_FSO_AGAIN.
  • Test 11 : Call function m0_io_fom_cob_rw_fini()
    Input : Read FOM
    Expected Output : Should de-allocate FOM.
  • Test 12 : Call m0_io_fom_cob_rw_tick()
    Input : Read FOM with invalid STOB id and current phase M0_FOPH_IO_STOB_INIT.
    Expected Output : Should return error.
  • Test 13 : Call m0_io_fom_cob_rw_tick()
    Input : Read FOM with current phase M0_FOPH_IO_ZERO_COPY_INIT and wrong network buffer descriptor.
    Expected Output : Should return error.
  • Test 14 : Call m0_io_fom_cob_rw_tick()
    Input : Read FOM with current phase M0_FOPH_IO_STOB_WAIT with result code of stob I/O m0_fom::m0_stob_io::si_rc set to I/O error.
    Expected Output : Should return error M0_FOS_FAILURE and I/O error set in relay FOP.

Integration Tests

All the tests mentioned in Unit test section will be implemented with actual bulk I/O client.

System Tests

All the tests mentioned in unit test section will be implemented with actual I/O (read, write) system calls.


Analysis

  • Acquiring network buffers for zero-copy need to be implemented as async operation, otherwise each I/O FOM try to acquire this resource resulting lots of request handler threads if buffers is not available.
  • Use of pre-allocated & pre-registered buffers could decrease I/O throughput since all I/O FOPs need this resource to process operation.
  • On other side usage of zero-copy improve the I/O performance.

References

References to other documents are essential.

  • Fop State Machines for IO FOPs For documentation links, please refer to this file : doc/motr-design-doc-list.rst
  • FOPFOM Programming Guide
  • High Level Design - FOP State Machine
  • High level design of rpc layer core