Motr
M0
|
The LNet Transport is built over an address space agnostic "core" I/O interface. This document describes the kernel implementation of this interface, which directly interacts with the Lustre LNet kernel module.
lustre-source
version 2.0 or greater).The relationship between the various objects in the components of the LNet transport and the networking layer is illustrated in the following UML diagram.
The Core layer in the kernel has no sub-components but interfaces directly with the Lustre LNet module in the kernel.
The kernel Core module is designed to support user space transports with the use of shared memory. It does not directly provide a mechanism to communicate with the user space transport, but expects that the user space Core module will provide a device driver to communicate between user and kernel space, manage the sharing of core data structures, and interface between the kernel and user space implementations of the Core API.
The common Core data structures are designed to support such communication efficiently:
The kernel Core module will maintain a unsigned integer counter per transfer machine, to generate unique match bits for passive bulk buffers associated with that transfer machine. The upper 12 match bits are reserved by the HLD to represent the transfer machine identifier. Therefore the counter is (64-12)=52 bits wide. The value of 0 is reserved for unsolicited receive messages, so the counter range is [1,0xfffffffffffff]. It is initialized to 1 and will wrap back to 1 when it reaches its upper bound.
The transport uses the nlx_core_buf_passive_recv() or the nlx_core_buf_passive_send() subroutines to stage passive buffers. Prior to initiating these operations, the transport should use the nlx_core_buf_desc_encode() subroutine to generate new match bits for the passive buffer. The match bit counter will repeat over time, though after a very long while. It is the transport's responsibility to ensure that all of the passive buffers associated with a given transfer machine have unique match bits. The match bits should be encoded into the network buffer descriptor associated with the passive buffer.
The kernel Core module must ensure that all transfer machines on the host have unique transfer machine identifiers for a given NID/PID/Portal, regardless of the transport instance or network domain context in which these transfer machines are created. To support this, the nlx_kcore_tms list threads through all the kernel Core's per-TM private data structures. This list is private to the kernel Core, and is protected by the nlx_kcore_mutex.
The same list helps in assigning dynamic transfer machine identifiers. The highest available value at the upper bound of the transfer machine identifier space is assigned dynamically. The logic takes into account the NID, PID and portal number of the new transfer machine when looking for an available transfer machine identifier. A single pass over the list is required to search for an available transfer machine identifier.
The kernel Core receives notification of the completion of a buffer operation through an LNet callback. The completion status is not directly conveyed to the transport, because the transport layer may have processor affinity constraints that are not met by the LNet callback thread; indeed, LNet does not even state if this callback is in a schedulable context.
Instead, the kernel Core module decouples the delivery of buffer operation completion to the transport from the LNet callback context by copying the result to an intermediate buffer event queue. The Core API provides the nlx_core_buf_event_wait() subroutine that the transport can use to poll for the presence of buffer events, and the nlx_core_buf_event_get() subroutine to recover the payload of the next available buffer event. See LNet Event Callback Processing for further details on these subroutines.
There is another advantage to this indirect delivery: to address the requirement to efficiently support a user space transport, the Core module keeps this queue in memory shared between the transport and the Core, eliminating the need for a user space transport to make an ioctl
call to fetch the buffer event payload. The only ioctl
call needed for a user space transport is to block waiting for buffer events to appear in the shared memory queue.
It is critical for proper operation, that there be an available buffer event structure when the LNet callback is invoked, or else the event cannot be delivered and will be lost. As the event queue is in shared memory, it is not possible, let alone desirable, to allocate a new buffer event structure in the callback context.
The Core API guarantees the delivery of buffer operation completion status by maintaining a "pool" of free buffer event structures for this purpose. It does so by keeping count of the total number of buffer event structures required to satisfy all outstanding operations, and adding additional such structures to the "pool" if necessary, when a new buffer operation is initiated. Likewise, the count is decremented for each buffer event delivered to the transport. Most buffers operations only need a single buffer event structure in which to return their operation result, but receive buffers may need more, depending on the individually configurable maximum number of messages that could be received in each receive buffer.
The pool and queue potentially span the kernel and user address spaces. There are two cases around the use of these data structures:
The kernel Core module combines both the free "pool" and the result queue into a single data structure: a circular, single producer, single consumer buffer event queue. Details on this event queue are covered in the LNet Buffer Event Circular Queue DLD.
The design makes a critical simplifying assumption, in that the transport will use exactly one thread to process events. This assumption implicitly serializes the delivery of the events associated with any given receive buffer, thus the last event which unlinks the buffer is guaranteed to be delivered after other events associated with that same buffer operation.
No initialization and finalization logic is required for LNet in the kernel for the following reasons:
No hardware optimization support is defined in the LNet API at this time but the nlx_core_buf_register() subroutine serves as a placeholder where any such optimizations could be made in the future. The nlx_core_buf_deregister() subroutine would be used to release any allocated resources.
During buffer registration, the kernel Core API will translate the m0_net_bufvec into the nlx_kcore_buffer::kb_kiov field of the buffer private data.
The kernel implementation of the Core API does not increment the page count of the buffer pages. The supposition here is that the buffers are allocated by Motr file system clients, and the Core API has no business imposing memory management policy beneath such a client.
A transfer machine is associated with the following LNet resources:
The nlx_core_tm_start() subroutine creates the event handle. The nlx_core_tm_stop() subroutine releases the handle. See LNet Event Callback Processing for more details on event processing.
A network buffer is associated with a Memory Descriptor (MD). This is represented by the nlx_kcore_buffer::kb_mdh handle. There may be a Match Entry (ME) associated with this MD for some operations, but when created, it is set up to unlink automatically when the MD is unlinked so it is not explicitly tracked.
All the buffer operation initiation subroutines of the kernel Core API create such MDs. Although an MD is set up to explicitly unlink upon completion, the value is saved in case an operation needs to be cancelled.
All MDs are associated with the EQ of the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).
LNet event queues are used with an event callback subroutine to avoid event loss. The callback subroutine overhead is fairly minimal, as it only copies out the event payload and arranges for subsequent asynchronous delivery. This, coupled with the fact that the circular buffer used works optimally with a single producer and single consumer resulted in the decision to use just one LNet EQ per transfer machine (nlx_kcore_transfer_mc::ktm_eqh).
The EQ is created in the call to the nlx_core_tm_start() subroutine, and is freed in the call to the nlx_core_tm_stop() subroutine.
LNet requires that the callback subroutine be re-entrant and non-blocking, and not make any LNet API calls. Given that the circular queue assumes a single producer and single consumer, a spin lock is used to serialize access of the two EQs to the circular queue.
The event callback requires that the MD user_ptr
field be set up to the address of the nlx_kcore_buffer data structure. Note that if an event has the unlinked
field set then this will be the last event that LNet will post for the related operation, and the user_ptr
field will be valid, so the callback can safely de-reference the field to determine the correct queue.
The callback subroutine does the following:
LNET_EVENT_SEND
events delivered as a result of a LNetGet()
call if the unlinked
field of the event is not set. If the unlinked
field is set, the event could either be an out-of-order SEND (terminating a REPLY/SEND sequence), or the piggy-backed UNLINK on an in-order SEND. The two cases are distinguished by explicitly tracking the receipt of an out-of-order REPLY (in nlx_kcore_buffer::kb_ooo_reply). An out-of-order SEND will be treated as though it is the terminating LNET_EVENT_REPLY
event of a SEND/REPLY sequence.LNET_EVENT_REPLY
events that do not have their unlinked
field set. They indicate an out-of-sequence REPLY/SEND combination, and LNet will issue a valid SEND event subsequently. However, the receipt of such an REPLY will be remembered in nlx_kcore_buffer::kb_ooo_reply, and its payload in the other "ooo" fields, so that when the out-of-order SEND arrives, this data can be used to generate the circular queue event.LNET_EVENT_ACK
events.unlinked
field of the event, which must be copied to the nlx_core_buffer_event::cbe_unlinked field. For LNET_EVENT_UNLINK
events, a -ECANCELED
value is written to the nlx_core_buffer_event::cbe_status field and the nlx_core_buffer_event::cbe_unlinked field set to true. For LNET_EVENT_PUT
events corresponding to unsolicited message delivery, the sender's TMID and Portal are encoded in the hdr_data. These values are decoded into the nlx_core_buffer_event::cbe_sender, along with the initiator's NID and PID. The nlx_core_buffer_event::cbe_sender is not set for other events.The (single) transport layer event handler thread blocks on the Core transfer machine semaphore in the Core API nlx_core_buf_event_wait() subroutine which uses the m0_semaphore_timeddown() subroutine internally to wait on the semaphore. When the Core API subroutine returns with an indication of the presence of events, the event handler thread consumes all the pending events with multiple calls to the Core API nlx_core_buf_event_get() subroutine, which uses the bev_cqueue_get() subroutine internally to get the next buffer event. Then the event handler thread repeats the call to the nlx_core_buf_event_wait() subroutine to once again block for additional events.
In the case of the user space transport, the blocking on the semaphore is done indirectly by the user space Core API's device driver in the kernel. It is required by the HLD that as many events as possible be consumed before the next context switch to the kernel must be made. To support this, the kernel Core nlx_core_buf_event_wait() subroutine takes a few additional steps to minimize the chance of returning when the queue is empty. After it obtains the semaphore with the m0_semaphore_timeddown() subroutine (i.e. the P operation succeeds), it attempts to clear the semaphore count by repeatedly calling the m0_semaphore_trydown() subroutine until it fails. It then checks the circular queue, and only if not empty will it return. This is illustrated with the following pseudo-code:
(The C++ style comments are used because of doxygen only - they are not permitted by the Motr style guide.)
LNetMEAttach()
for the transfer machine and specify the portal, match and ignore bits. All receive buffers for a given TM will use a match bit value equal to the TM identifier in the higher order bits and zeros for the other bits. No ignore bits are set. The ME should be set up to unlink automatically as it will be used for all receive buffers of this transfer machine. The ME entry should be positioned at the end of the portal match list. There is no need to retain the ME handle beyond the subsequent LNetMDAttach()
call.LNetMDAttach()
. The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t
argument as follows:eq_handle
to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).user_ptr
field.threshold
value to the nlx_kcore_buffer::kb_max_recv_msgs value.max_size
value to the nlx_kcore_buffer::kb_min_recv_size value.LNET_MD_OP_PUT
, LNET_MD_MAX_SIZE
and LNET_MD_KIOV
flags in the options
field.LNET_EVENT_PUT
event will be delivered to the event queue, and will be processed as described in LNet Event Callback Processing.LNetMDBind()
with each invocation of the nlx_core_buf_msg_send() subroutine. The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t
argument as follows:eq_handle
to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).user_ptr
field.LNET_MD_KIOV
flag in the options
field.LNetPut()
subroutine to send the MD to the destination. The match bits must set to the destination TM identifier in the higher order bits and zeros for the other bits. The hdr_data must be set to a value encoding the TMID (in the upper bits, like the match bits) and the portal (in the lower bits). No acknowledgment should be requested.LNET_EVENT_SEND
event will be delivered to the event queue, and processed as described in LNet Event Callback Processing. LNetMEAttach()
. Specify the portal and match_id fields as appropriate for the transfer machine. The buffer's match bits are obtained from the nlx_core_buffer::cb_match_bits field. No ignore bits are set. The ME should be set up to unlink automatically, so there is no need to save the handle for later use. The ME should be positioned at the end of the portal match list.LNetMDAttach()
with each invocation of the nlx_core_buf_passive_recv() or the nlx_core_buf_passive_send() subroutines. The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t
argument as follows:eq_handle
to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).user_ptr
field.LNET_MD_KIOV
flag in the options
field, along with either the LNET_MD_OP_PUT
or the LNET_MD_OP_GET
flag according to the direction of data transfer.LNET_EVENT_PUT
or an LNET_EVENT_GET
event will be delivered to the event queue, and will be processed as described in LNet Event Callback Processing.LNetMDBind()
with each invocation of the nlx_core_buf_active_recv() or nlx_core_buf_active_send() subroutines. The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t
argument as follows:eq_handle
to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).user_ptr
field.LNET_MD_KIOV
flag in the options
field.LNetGet()
, set the threshold value to 2 to accommodate both the SEND and the REPLY events. Otherwise set it to 1.LNetGet()
subroutine to initiate the active read or the LNetPut()
subroutine to initiate the active write. The hdr_data
is set to 0 in the case of LNetPut()
. No acknowledgment should be requested. In the case of an LNetGet()
, the field used to track out-of-order REPLY events (nlx_kcore_buffer::kb_ooo_reply) should be cleared before the operation is initiated.LNetGet()
or LNetPut()
call completes, an LNET_EVENT_SEND
event will be delivered to the event queue and should typically be ignored in the case of LNetGet()
. See LNet Event Callback Processing for details.When the bulk data transfer for LNetGet()
completes, an LNET_EVENT_REPLY
event will be delivered to the event queue, and will be processed as described in LNet Event Callback Processing.
LNetGet()
operation. Also note that in the case of an LNetGet()
operation, the SEND event does not indicate if the recipient was able to save the data, but merely that the request left the host.The kernel Core module provides no timeout capability. The transport may initiate a cancel operation using the nlx_core_buf_del() subroutine.
This will result in an LNetMDUnlink()
subroutine call being issued for the buffer MD saved in the nlx_kcore_buffer::kb_mdh field. Cancellation may or may not take place - it depends upon whether the operation has started, and there is a race condition in making this call and concurrent delivery of an event associated with the MD.
Assuming success, the next event delivered for the buffer concerned will either be a LNET_EVENT_UNLINK
event or the unlinked
field will be set in the next completion event for the buffer. The events will be processed as described in LNet Event Callback Processing.
LNet properly handles the race condition between the automatic unlink of the MD and a call to LNetMDUnlink()
.
LNetGet()
calls, in the nlx_kcore_buffer data structure associated with the call. This is because there are two events (SEND and REPLY) that are returned for this operation, and LNet does not guarantee their order of arrival, and the event processing logic is set up such that a circular buffer event must be created only upon receipt of the last operation event. Complicating the issue is that a cancellation response could be piggy-backed onto an in-order SEND. See LNet Event Callback Processing and LNet Active Bulk Read or Write for details.LNetMDUnlink()
.The LNet transport will initiate calls to the API on threads that may have specific process affinity assigned.
LNet offers no direct NUMA optimizations. In particular, event callbacks cannot be constrained to have any specific processor affinity. The API compensates for this lack of support by providing a level of indirection in event delivery: its callback handler simply copies the LNet event payload to an event delivery queue and notifies a transport event processing thread of the presence of the event. (See The Buffer Event Queue above). The transport event processing threads can be constrained to have any desired processor affinity.
The testing strategy is 2 pronged:
"tcp"
network.System testing will be performed as part of the transport operation system test.