Overview
This document contains the detailed level design of the Bulk I/O Service.
Purpose
The purpose of this document is to:
- Refine higher level designs
- To be verified by inspectors and architects
- To guide the coding phase
Definitions
Terms used in this document defined as below :
- Bulk I/O Service Motr ioservice which process read/write FOPs.
- FOP File operation packet, a description of file operation suitable for sending over network or storing on a storage device. File operation packet (FOP) identifies file operation type and operation parameters.
- FOM FOP state machine (FOM) is a state machine that represents current state of the FOP's execution on a node. FOM is associated with the particular FOP and implicitly includes this FOP as part of its state.
- zero-copy Copy between a source and destination takes place without any intermediate copies to staging areas.
- STOB Storage object (STOB) is a basic M0 data structure containing raw data.
- COB Component object (COB) is a component (stripe) of a file, referencing a single storage object and containing metadata describing the object.
- rpc_bulk Generic interface for zero-copy.
- buffer_pool Pre-allocated & pre-registered pool of buffers. Buffer pool also provides interfaces to get/put buffers. Every Bulk I/O Service initiates its buffer_pool.
- Configuration cache Configuration data being stored in node’s memory.
Requirements
- r.bulkserver.async Bulk I/O server runs asynchronously.
- r.non-blocking.few-threads Motr service should use a relatively small number of threads: a few per processor.
- r.non-blocking.easy Non-blocking infrastructure should be easy to use and non-intrusive.
- r.non-blocking.extensibility Addition of new "cross-cut" functionality (e.g., logging, reporting) potentially including blocking points and affecting multiple fop types should not require extensive changes to the data-structures for each fop type involved.
- r.non-blocking.network Network communication must not block handler threads.
- r.non-blocking.storage Storage transfers must not block handler threads.
- r.non-blocking.resources Resource acquisition and release must not block handler threads.
- r.non-blocking.other-block Other potentially blocking conditions (page faults, memory allocations, writing trace records, etc.) must never block all service threads.
Design Overview
Bulk I/O Service will be available in the form of state machine to process bulk I/O request. It uses generic rpc_bulk interface to use zero-copy RDMA mechanism from transport layer to copy data from source to destination. It also use STOB I/O interface to complete the I/O operation.
Bulk I/O Service implements I/O FOMs to process I/O FOPs Fop State Machines for IO FOPs.
The Bulk I/O Service interface m0_ioservice_fop_init() registers and initiates I/O FOPs with it. Following are the Bulk I/O Service FOP type.
The Bulk I/O Service initiates buffer_pool during its initialization. Bulk I/O Service gets buffers required from buffer_pool and pass it to rpc_bulk for zero-copy. Buffers then returns back to buffer pool after data written on STOB.
The Bulk I/O Service initialization done by request handler during its startup.
Logical Specification
Sequence diagram
This section describes how client and server communications happens while processing read/write FOPs. This also shows usage of zero-copy in I/O FOP processing.
Write operation with zero-copy data transfer
- Client sends write FOP to server. Write FOP contains the network buffer descriptor list and indexvecs list instead of actual data.
- To process write FOP, request handler creates & initiates write FOM and puts it into run queue for execution.
- State transition function go through generic and extended phases m0_io_fom_cob_rw_phases defined for I/O FOM (write FOM).
- Gets as many buffers as it can from buffer_pool to transfer data for all descriptors. If there are insufficient buffers with buffer_pool to process all descriptors then its goes by batch by batch. At least one buffer is needed to start bulk transfer. If no buffer available then bulk I/O Service will wait till buffer_pool becomes non-empty.
- Initiates zero-copy using rpc_bulk on acquired buffers and wait for zero-copy to complete for all descriptors on which it initiated.
- Zero-copy completes
- Initiates write data on STOB for all indexvec and wait for STOB I/O to complete
- STOB I/O completes
- Returns back some of buffers to buffer_pool if they are more than remaining descriptors.
- Enqueue response in fo_rep_fop for the request handler to send the response back to the client
Read operation with zero-copy data transfer
- Client sends read FOP to server. Read FOP contains the network buffer descriptor list and indexvecs list instead of actual data.
- To process read FOP, request handler creates & initiates read FOM and puts it into run queue for execution.
- State transition function go through generic and extended phases m0_io_fom_readv_phases defined for read FOM.
- Gets as many buffers as it can from buffer_pool to transfer data for all descriptors. If there are insufficient buffers with buffer_pool to process all descriptors then its goes by batch by batch. At least one buffer is needed to start bulk transfer. If no buffer available then bulk I/O Service will wait till buffer_pool becomes non-empty.
- Initiates read data from STOB for all indexvecs and wait for STOB I/O to completes
- STOB I/O completes
- Initiates zero-copy using rpc_bulk on acquired buffers and wait for zero-copy to complete for all descriptors on which it initiated.
- Zero-copy completes
- Returns back some of buffers to buffer_pool if they are more than remaining descriptors.
- Enqueue response in fo_rep_fop for the request handler to send the response back to the client
On the basis of steps involved in these operations enumeration called m0_io_fom_cob_rw_phases will be defined, that extends the standard FOM phases (enum m0_fom_standard_phase) with new phases to handle the state machine that sets up and executes read/write operations respectively involving bulk I/O.
State Transition Diagrams
State Diagram For Write FOM :
Bulk I/O Service FOMs will be placed in wait queue for all states which needs to wait for task complete.
State Diagram For Read FOM :
Bulk I/O Service FOMs will be placed in wait queue for all states which needs to wait for task complete.
Buffers Management
- Buffers Initialization & De-allocation :
I/O service maintains m0_buf_pool instance with data structure m0_reqh_service. Buffer pool m0_reqh_service::m0_buf_pool will be initialized in Bulk I/O Service start operation vector m0_io_service_start(). Bulk I/O service will use m0_buf_pool_init() to allocate and register specified number of network buffers and with specified size.
Bulk I/O Service needs following parameters from configuration database to initialize buffer pool -
IO_BULK_BUFFER_POOL_SIZE Number of network buffers in buffer pool. IO_BULK_BUFFER_SIZE Size of each network buffer. IO_BULK_BUFFER_NUM_SEGMENTS Number of segments in each buffer.
Buffer pool de-allocation takes place in service operation vector m0_io_service_stop(). I/O service will use m0_buf_pool_fini() to de-allocate & de-register the network buffers.
The buffer pool for bulk data transfer is private to the Bulk I/O service and is shared by all FOM instances executed by the service.
Bulk I/O Servers acquire the network buffer by calling buffer_pool interface m0_buf_pool_get(). If buffer available with buffer_pool then this function returns network buffer. And if buffer_pool empty the function returns NULL. Then FOM need to wait for _notEmpty signal from buffer_pool.
Bulk I/O Service needs to get lock on buffer_pool instance while its request network buffer. And release lock after it get network buffer.
Bulk I/O Servers release the network buffer by calling buffer_pool interface m0_buf_pool_put(). It return back network buffer to buffer_pool.
Bulk I/O Service needs to get lock on buffer_pool instance while it request network buffer. And release lock after it get network buffer.
- Buffer Pool Expansion
- Todo:
- If buffer_pool reached to low threshold, Bulk I/O service may expand pool size. This can be done later to minimize waiting time for network buffer.
Service Registration
Bulk I/O Service defines service type as follows -
struct m0_reqh_service_type m0_ios_type = { .rst_name = "M0_CST_IOS", .rst_ops = &ios_type_ops, .rst_level = M0_RS_LEVEL_NORMAL, .rst_typecode = M0_CST_IOS, };
It also assigns service name and service type operations for Bulk I/O Service.
- Service Type Registration
Bulk I/O Service registers its service type with request handler using interface m0_reqh_service_type_register(). This function registers service type with global service type list for request handler. Service type operation m0_ioservice_alloc_and_init() will do this registration.
Threading and Concurrency Model
- resources
It uses pre-allocated and pre-registered network buffers. These buffers will not released until zero-copy completes and data from net buffers transfered to/from STOB. Since these buffers are pre-allocated & pre-registered with transport layer there should be some lock on these buffers so that no one can use same buffers.
NUMA optimizations
Dependencies
- r.reqh : Request handler to execute Bulk I/O Service FOM
- r.bufferpool : Network buffers for zero-copy
- r.fop : To send bulk I/O operation request to server
- r.net.rdma : Zero-copy data mechanism at network layer
- r.stob.read-write : STOB I/O
- r.rpc_bulk : For using zero-copy mechanism
- r.configuration.caching : Configuration data being stored in node's memory.
Conformance
- i.bulkserver.async It implements state transition interface so to run I/O bulk service asynchronously.
Unit Tests
For isolated unit tests, each function implemented as part of Bulk I/O Service needs to test separately without communicating with other modules. This is not required to use other modules which are communicating with Bulk I/O Server modules.
- Test 01 : Call function m0_io_fom_cob_rw_create()
Input : Read FOP (in-memory data structure m0_fop)
Expected Output : Create FOM of corresponding FOP type.
- Test 02 : Call function m0_io_fom_cob_rw_create()
Input : Write FOP (in-memory data structure m0_fop)
Expected Output : Create FOM of corresponding FOP type.
- Test 03 : Call function m0_io_fom_cob_rw_init()
Input : Read FOP (in-memory data structure m0_fop)
Expected Output : Initiates FOM with corresponding operation vectors and other pointers.
- Test 04 : Call function m0_io_fom_cob_rw_init()
Input : Write FOP (in-memory data structure m0_fop)
Expected Output : Initiates FOM with corresponding operation vectors and other pointers.
- Test 05 : Call m0_io_fom_cob_rw_tick() with buffer pool size 1
Input : Read FOM with current phase M0_FOPH_IO_FOM_BUFFER_ACQUIRE
Expected Output : Gets network buffer and pointer set into FOM with phase changed to M0_FOPH_IO_STOB_INIT and return value M0_FSO_AGAIN.
- Test 06 : Call m0_io_fom_cob_rw_tick() with buffer pool size 0 (empty buffer_pool)
Input : Read FOM with current phase M0_FOPH_IO_FOM_BUFFER_ACQUIRE
Expected Output : Should not gets network buffer and NULL pointer set into FOM with phase changed to M0_FOPH_IO_FOM_BUFFER_WAIT and return value M0_FSO_WAIT.
- Test 07 : Call m0_io_fom_cob_rw_tick() with buffer pool size 0 (empty buffer_pool)
Input : Read FOM with current phase M0_FOPH_IO_FOM_BUFFER_WAIT
Expected Output : Should not gets network buffer and NULL pointer set into FOM with phase not changed and return value M0_FSO_WAIT.
- Test 08 : Call m0_io_fom_cob_rw_tick()
Input : Read FOM with current phase M0_FOPH_IO_STOB_INIT
Expected Output : Initiates STOB read with phase changed to M0_FOPH_IO_STOB_WAIT and return value M0_FSO_WAIT.
- Test 09 : Call m0_io_fom_cob_rw_tick()
Input : Read FOM with current phase M0_FOPH_IO_ZERO_COPY_INIT
Expected Output : Initiates zero-copy with phase changed to M0_FOPH_IO_ZERO_COPY_WAIT return value M0_FSO_WAIT.
- Test 10 : Call m0_io_fom_cob_rw_tick() with buffer pool size 1
Input : Write FOM with current phase M0_FOPH_IO_FOM_BUFFER_ACQUIRE
Expected Output : Gets network buffer and pointer set into FOM with phase changed to M0_FOPH_IO_ZERO_COPY_INIT and return value M0_FSO_AGAIN.
- Test 11 : Call function m0_io_fom_cob_rw_fini()
Input : Read FOM
Expected Output : Should de-allocate FOM.
- Test 12 : Call m0_io_fom_cob_rw_tick()
Input : Read FOM with invalid STOB id and current phase M0_FOPH_IO_STOB_INIT.
Expected Output : Should return error.
- Test 13 : Call m0_io_fom_cob_rw_tick()
Input : Read FOM with current phase M0_FOPH_IO_ZERO_COPY_INIT and wrong network buffer descriptor.
Expected Output : Should return error.
- Test 14 : Call m0_io_fom_cob_rw_tick()
Input : Read FOM with current phase M0_FOPH_IO_STOB_WAIT with result code of stob I/O m0_fom::m0_stob_io::si_rc set to I/O error.
Expected Output : Should return error M0_FOS_FAILURE and I/O error set in relay FOP.
Integration Tests
All the tests mentioned in Unit test section will be implemented with actual bulk I/O client.
System Tests
All the tests mentioned in unit test section will be implemented with actual I/O (read, write) system calls.
Analysis
- Acquiring network buffers for zero-copy need to be implemented as async operation, otherwise each I/O FOM try to acquire this resource resulting lots of request handler threads if buffers is not available.
- Use of pre-allocated & pre-registered buffers could decrease I/O throughput since all I/O FOPs need this resource to process operation.
- On other side usage of zero-copy improve the I/O performance.
References
References to other documents are essential.
- Fop State Machines for IO FOPs For documentation links, please refer to this file : doc/motr-design-doc-list.rst
- FOPFOM Programming Guide
- High Level Design - FOP State Machine
- High level design of rpc layer core