Overview

Note: This DLD is written by Huang Hua (hua.h.nosp@m.uang.nosp@m.@seag.nosp@m.ate..nosp@m.com), 2012/10/10.

This DLD describes how the m0t1fs does I/O with SNS in normal condition, in de-graded mode, and when SNS repair is completed.

A file (also known as global object) in Motr is stored in multiple component objects, spreading on multiple servers. This is usually also called Server Network Striping, a.k.a SNS. Layout is used to describe the mapping of a file to its objects. A read request to some specific offset within a file will be directed to some parts of its component objects, according to its layout. A write request does the same. Some files don't store redundancy information in the file system, like RAID0. But in Motr, the default and typical mode is to have files with redundancy data stored somewhere. So the write requests may include updates to redundancy data.

In case of node or device failure, lost data may be re-constructed from redundancy information. A read request to lost data needs to be satisfied by re-constructing data from its parity data. When SNS repair is completed for the failed node or device, a read or write request can be served by re-directing to its spare unit.

Write requests to failed node or device should be handled in another way, cooperated with SNS repair and NBA (Non-Blocking Availability). Now it is out of the scope of this DLD.

Each client has a cache of Failure Vectors of a pool. With failure vector information, clients know whether to re-construct data from other data units and parity units, or read from spare units (which contains repaired data). The detailed will be discussed in the following logical specification.

Definitions

Previously defined terms:

layout A mapping from Motr file (global object) to component objects. See Layouts. for more details.
SNS Server Network Striping. See SNS for more details.

Requirements

R.iosnsrepair.read Read request should be served in normal case, during SNS repair, and after SNS repair completes.
R.iosnsrepair.write Write request should be served in normal case, and after SNS repair completes.
R.iosnsrepair.code Code should be re-used and shared with other m0t1fs client features, especially the rmw feature.

Dependencies

The feature depends on the following features:

layout.
SNS and failure vector.

The implementation of this feature may depend on the m0t1fs read-modify-write (rmw) feature, which is under development.

Design Highlights

M0t1fs read-modify-write (rmw) feature has some similar concepts with this feature. The same code path will be used to serve both the features.

Logical Specification

Component Overview

When an I/O request (read, write, or other) comes to client, m0t1fs first checks its cached failure vector to see the status of pool nodes and devices. A read or write request will span some node(s) or device(s). If these node(s) or device(s) are ONLINE, this is the normal case. If some node or device is FAILED or REPAIRING or REPAIRED, it will change the state of a pool. When all nodes and devices are in ONLINE status, the pool is ONLINE. I/O requests are handled normally. If less than or equal failures happen than the pool is configured to sustain, the pool is DEGRADED. I/O requests will be handled with the help of parity information or spare units. If more failures happen than the pool is configured to sustain , the pool is in DUD state, where all I/O requests will fail with -EIO error. The pool states define how client IO is made, specifically whether writes use NBA and whether read and writes use degraded mode. The pool states can be calculated from the failure vector.

If the special action is taken to serve this request. The following table illustrate the actions:

                    Table 1   I/O request handling

	ONLINE	read from the target device
	OFFLINE	same as FAILED
	---------—	----------------------------------------------------—
read	FAILED	read from other data unit(s) and parity unit(s) and
	REPAIRING	re-construct the data. If NBA** exists, use new
		layout to do reading if necessary.
		See more detail for this degraded read (1)
	---------—	----------------------------------------------------—
	REPAIRED	read from the repaired spare unit or use new layout
		if NBA**
---—	---------—	----------------------------------------------------—
	ONLINE	write to the target device
	---------—	----------------------------------------------------—
	OFFLINE	same as FAILED
	---------—	----------------------------------------------------—
write	FAILED	NBA** determines to use new layout or old layout
		if old layout is used, this is called degraded
		write. See more detail in the following (2)
	---------—	----------------------------------------------------—
	REPAIRING	Concurrent++ write I/O and sns repairing is out of
		the scope of this DLD. Not supported currently.
		-ENOTSUP will be returned at this moment.
		This is Concurrent r/w in SNS repair
	---------—	----------------------------------------------------—
	REPAIRED	write to the repaired spare unit or new layout
		if NBA**

-------------------------------------------------------------------------—| NBA** Non-Blocking Availability. When a device/node is not available for a write request, the system switches the file to use a new layout, and so the data is written to devices in new layout. By such means, the writing request will not be blocked waiting the device to be fixed, or SNS repair to be completed. Device/node becomes un-available when it is OFFLINE or FAILED. Concurrent++ This should be designed in other module.

A device never goes from repaired to online. When the re-balancing process that moves data from spare space to a new device completes, the new device goes from REBALANCING to ONLINE state. If the old device is ever "fixed" somehow, it becomes a new device in ONLINE state.

A degraded read request is handled with the following steps: (1) Calculate its parity group, find out related data units and parity units. This needs help from the file's layout. (2) Send read requests to necessary data units and/or parity units asynchronously. The read request itself is blocked and waiting for those replies. For a N+K+K (N data units, K parity units, K spare units) layout, N units of data or parity units are needed to re-compute the lost data. (3) When those read replies come, ASTs will be called to re-compute the data iteratively. Temporary result is stored in the buffer of the original read request. This async read request and its reply can be released (no cache at this moment). (4) When all read replies come back, and the data is finally re-computed, the original read request has its data, and can be returned to user.

A degraded write request is handled as the following: (1) Calculate its parity group, find out related data units and parity units. This needs help from the file's layout. (2) Prepare to async read data and/or parity units. (2.1) If this is a full-stripe write request, skip to step (4). (2.2) If write request only spans ONLINE devices, this is similar to a Read-Modify-Write (rmw), except a little difference: only async read the spanned data unit(s). Async read all spanned data units. (2.3) If write request spans FAILED/OFFLINE devices, async read all survival and un-spanned data units and necessary parity unit(s). (3) When these async read requests complete, replies come back to client. (3.1) for (2.2) case, prepare new parity units from the old data and new data. (3.2) for (2.3) case, first, re-calculate the lost data unit, and do the same as 3.1. (4) Send write request(s) to data units, along with all the new parity data, except the failed device(s). Please note here: write to the failed devices are excluded to send. (5) When those write requests complete, return to user space.

The same thread used by the rmw will be used here to run the ASTs. The basic algorithm is similar in these two features. No new data structures are introduced by this feature.

Pool's failure vector is cached on clients. Every I/O request to ioservices is tagged with client known failure vector version, and this version is checked against the lastest one by ioservices. If the client known version is stale, new version and failure vector updates will be returned back to clients and clients need to apply this update and do I/O request according to the latest version. Please see Storage pools. and Pool machine for more details.

Which spare space to use in the SNS repair is managed by failure vector. After SNS repair, client can query this information from failure vector and send r/w request to corresponding spare space.

State Specification

N/A

Threading and Concurrency Model

See Detailed Level Design for read-modify-write IO requests. for more information.

NUMA optimizations

See Detailed Level Design for read-modify-write IO requests. for more information.

Conformance

I.iosnsrepair.read Read request handling is described in logic specification. Every node/device state are covered.
I.iosnsrepair.read Write request handling is described in logic specification. Every node/device state are covered.
I.iosnsrepair.code In logic specification, the design says the same code and algorithm will be used to handle io request in SNS repair and rmw.

Unit Tests

Unit tests for read and write requests in different devices state are needed. These states includes: ONLINE, OFFLINE, FAILED, REPAIRING, REPAIRED.

System Tests

System tests are needed to verify m0t1fs can serve read/write properly when node/device is in various states, and changes its state from one to another. For example:

read/write requests in normal case.
read/write requests when a device changes from ONLINE to FAILED.
read/write requests when a device changes from FAILED to REPAIRING.
read/write requests when a device changes from REPAIRING to REPAIRED.

Analysis

See rmw for more information.

References

HLD of SNS repair : For documentation links, please refer to this file : doc/motr-design-doc-list.rst
Detailed Level Design for read-modify-write IO requests. m0t1fs client read-modify-write DLD
Layouts. Layout
Storage pools. Pool and Pool machine Pool Machine.

Implementation Plan

The code implementation depends on the m0t1fs rmw which is under development. The rmw is in code inspection phase right now. When this DLD is approved, code maybe can start.