Motr  M0
LNet Transport Kernel Core DLD

Component Overview

Conformance


Overview

The LNet Transport is built over an address space agnostic "core" I/O interface. This document describes the kernel implementation of this interface, which directly interacts with the Lustre LNet kernel module.


Definitions

  • HLD of Motr LNet Transport : For documentation links, please refer to this file : doc/motr-design-doc-list.rst

Requirements

  • r.m0.net.lnet.buffer-registration Provide support for hardware optimization through buffer pre-registration.
  • r.m0.net.xprt.lnet.end-point-address The implementation should support the mapping of end point address to LNet address as described in the Refinement section of the HLD.
  • r.m0.net.xprt.lnet.multiple-messages-in-buffer Provide support for this feature as described in the HLD.
  • r.m0.net.xprt.lnet.dynamic-address-assignment Provide support for dynamic address assignment as described in the HLD.
  • r.m0.net.xprt.lnet.user-space The implementation must accommodate the needs of the user space LNet transport.
  • r.m0.net.xprt.lnet.user.no-gpl The implementation must not expose the user space transport to GPL interfaces.

Dependencies

  • LNet API headers are required to build the module. The Lustre source package must be installed on the build machine (RPM lustre-source version 2.0 or greater).
  • Lustre run time
  • r.m0.lib.atomic.interoperable-kernel-user-support The Buffer Event Circular Queue provides a shared data structure for efficiently passing event notifications from the Core layer to the LNet transport layer.
  • r.net.xprt.lnet.growable-event-queue The Buffer Event Circular Queue provides a way to expand the event queue as new buffers are queued with a transfer machine, ensuring no events are lost.

Design Highlights

  • The Core API is an address space agnostic I/O interface intended for use by the Motr Networking LNet transport operation layer in either user space or kernel space.
  • Efficient support for the user space transports is provided by use of cross-address space tolerant data structures in shared memory.
  • The Core API does not expose any LNet symbols.
  • Each transfer machine is internally assigned one LNet event queue for all its LNet buffer operations.
  • Pre-allocation of buffer event space to guarantee that buffer operation results can be returned.
  • The notification of the completion of a buffer operation to the transport layer is decoupled from the LNet callback that provided this notification to the core module.
  • The number of messages that can be delivered into a single receive buffer is bounded to support pre-allocation of memory to hold the buffer event payload.
  • Buffer completion event notification is provided via a semaphore. The design guarantees delivery of events in the order received from LNet. In particular, the multiple possible events delivered for a single receive buffer will be ordered.

Logical Specification

Component Overview

The relationship between the various objects in the components of the LNet transport and the networking layer is illustrated in the following UML diagram.

lnet_xo.png
LNet Transport Objects

The Core layer in the kernel has no sub-components but interfaces directly with the Lustre LNet module in the kernel.

Support for User Space Transports

The kernel Core module is designed to support user space transports with the use of shared memory. It does not directly provide a mechanism to communicate with the user space transport, but expects that the user space Core module will provide a device driver to communicate between user and kernel space, manage the sharing of core data structures, and interface between the kernel and user space implementations of the Core API.

The common Core data structures are designed to support such communication efficiently:

  • The core data structures are organized with a distinction between the common directly shareable portions, and private areas for kernel and user space data. This allows each address space to place pointer values of its address space in private regions associated with the shared data structures.
  • An address space opaque pointer type is provided to safely save pointer values in shared memory locations where necessary.
  • The single producer, single consumer circular buffer event queue shared between the transport and the core layer in the kernel is designed to work with the producer and consumer potentially in different address spaces. This is described in further detail in The Buffer Event Queue.

Match Bits for Buffer Identification

The kernel Core module will maintain a unsigned integer counter per transfer machine, to generate unique match bits for passive bulk buffers associated with that transfer machine. The upper 12 match bits are reserved by the HLD to represent the transfer machine identifier. Therefore the counter is (64-12)=52 bits wide. The value of 0 is reserved for unsolicited receive messages, so the counter range is [1,0xfffffffffffff]. It is initialized to 1 and will wrap back to 1 when it reaches its upper bound.

The transport uses the nlx_core_buf_passive_recv() or the nlx_core_buf_passive_send() subroutines to stage passive buffers. Prior to initiating these operations, the transport should use the nlx_core_buf_desc_encode() subroutine to generate new match bits for the passive buffer. The match bit counter will repeat over time, though after a very long while. It is the transport's responsibility to ensure that all of the passive buffers associated with a given transfer machine have unique match bits. The match bits should be encoded into the network buffer descriptor associated with the passive buffer.

Transfer Machine Uniqueness

The kernel Core module must ensure that all transfer machines on the host have unique transfer machine identifiers for a given NID/PID/Portal, regardless of the transport instance or network domain context in which these transfer machines are created. To support this, the nlx_kcore_tms list threads through all the kernel Core's per-TM private data structures. This list is private to the kernel Core, and is protected by the nlx_kcore_mutex.

The same list helps in assigning dynamic transfer machine identifiers. The highest available value at the upper bound of the transfer machine identifier space is assigned dynamically. The logic takes into account the NID, PID and portal number of the new transfer machine when looking for an available transfer machine identifier. A single pass over the list is required to search for an available transfer machine identifier.

The Buffer Event Queue

The kernel Core receives notification of the completion of a buffer operation through an LNet callback. The completion status is not directly conveyed to the transport, because the transport layer may have processor affinity constraints that are not met by the LNet callback thread; indeed, LNet does not even state if this callback is in a schedulable context.

Instead, the kernel Core module decouples the delivery of buffer operation completion to the transport from the LNet callback context by copying the result to an intermediate buffer event queue. The Core API provides the nlx_core_buf_event_wait() subroutine that the transport can use to poll for the presence of buffer events, and the nlx_core_buf_event_get() subroutine to recover the payload of the next available buffer event. See LNet Event Callback Processing for further details on these subroutines.

There is another advantage to this indirect delivery: to address the requirement to efficiently support a user space transport, the Core module keeps this queue in memory shared between the transport and the Core, eliminating the need for a user space transport to make an ioctl call to fetch the buffer event payload. The only ioctl call needed for a user space transport is to block waiting for buffer events to appear in the shared memory queue.

It is critical for proper operation, that there be an available buffer event structure when the LNet callback is invoked, or else the event cannot be delivered and will be lost. As the event queue is in shared memory, it is not possible, let alone desirable, to allocate a new buffer event structure in the callback context.

The Core API guarantees the delivery of buffer operation completion status by maintaining a "pool" of free buffer event structures for this purpose. It does so by keeping count of the total number of buffer event structures required to satisfy all outstanding operations, and adding additional such structures to the "pool" if necessary, when a new buffer operation is initiated. Likewise, the count is decremented for each buffer event delivered to the transport. Most buffers operations only need a single buffer event structure in which to return their operation result, but receive buffers may need more, depending on the individually configurable maximum number of messages that could be received in each receive buffer.

The pool and queue potentially span the kernel and user address spaces. There are two cases around the use of these data structures:

  • Normal queue operation involves a single producer, in the kernel Core callback subroutine, and a single consumer, in the Core API nlx_core_buf_event_get() subroutine, which may be invoked either in the kernel or in user space.
  • The allocation of new buffer event structures to the "pool" is always done by the Core API buffer operation initiation subroutines invoked by the transport. The user space implementation of the Core API would have to arrange for these new structures to get mapped into the kernel at this time.

The kernel Core module combines both the free "pool" and the result queue into a single data structure: a circular, single producer, single consumer buffer event queue. Details on this event queue are covered in the LNet Buffer Event Circular Queue DLD.

The design makes a critical simplifying assumption, in that the transport will use exactly one thread to process events. This assumption implicitly serializes the delivery of the events associated with any given receive buffer, thus the last event which unlinks the buffer is guaranteed to be delivered after other events associated with that same buffer operation.

LNet Initialization and Finalization

No initialization and finalization logic is required for LNet in the kernel for the following reasons:

  • Use of the LNet kernel module is reference counted by the kernel.
  • The LNetInit() subroutine is automatically called when then LNet kernel module is loaded, and cannot be called multiple times.

LNet Buffer Registration

No hardware optimization support is defined in the LNet API at this time but the nlx_core_buf_register() subroutine serves as a placeholder where any such optimizations could be made in the future. The nlx_core_buf_deregister() subroutine would be used to release any allocated resources.

During buffer registration, the kernel Core API will translate the m0_net_bufvec into the nlx_kcore_buffer::kb_kiov field of the buffer private data.

The kernel implementation of the Core API does not increment the page count of the buffer pages. The supposition here is that the buffers are allocated by Motr file system clients, and the Core API has no business imposing memory management policy beneath such a client.

LNet Transfer Machine Resources

A transfer machine is associated with the following LNet resources:

The nlx_core_tm_start() subroutine creates the event handle. The nlx_core_tm_stop() subroutine releases the handle. See LNet Event Callback Processing for more details on event processing.

LNet Buffer Resources

A network buffer is associated with a Memory Descriptor (MD). This is represented by the nlx_kcore_buffer::kb_mdh handle. There may be a Match Entry (ME) associated with this MD for some operations, but when created, it is set up to unlink automatically when the MD is unlinked so it is not explicitly tracked.

All the buffer operation initiation subroutines of the kernel Core API create such MDs. Although an MD is set up to explicitly unlink upon completion, the value is saved in case an operation needs to be cancelled.

All MDs are associated with the EQ of the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).

LNet Event Callback Processing

LNet event queues are used with an event callback subroutine to avoid event loss. The callback subroutine overhead is fairly minimal, as it only copies out the event payload and arranges for subsequent asynchronous delivery. This, coupled with the fact that the circular buffer used works optimally with a single producer and single consumer resulted in the decision to use just one LNet EQ per transfer machine (nlx_kcore_transfer_mc::ktm_eqh).

The EQ is created in the call to the nlx_core_tm_start() subroutine, and is freed in the call to the nlx_core_tm_stop() subroutine.

LNet requires that the callback subroutine be re-entrant and non-blocking, and not make any LNet API calls. Given that the circular queue assumes a single producer and single consumer, a spin lock is used to serialize access of the two EQs to the circular queue.

The event callback requires that the MD user_ptr field be set up to the address of the nlx_kcore_buffer data structure. Note that if an event has the unlinked field set then this will be the last event that LNet will post for the related operation, and the user_ptr field will be valid, so the callback can safely de-reference the field to determine the correct queue.

The callback subroutine does the following:

  1. It will ignore LNET_EVENT_SEND events delivered as a result of a LNetGet() call if the unlinked field of the event is not set. If the unlinked field is set, the event could either be an out-of-order SEND (terminating a REPLY/SEND sequence), or the piggy-backed UNLINK on an in-order SEND. The two cases are distinguished by explicitly tracking the receipt of an out-of-order REPLY (in nlx_kcore_buffer::kb_ooo_reply). An out-of-order SEND will be treated as though it is the terminating LNET_EVENT_REPLY event of a SEND/REPLY sequence.
  2. It will not create an event in the circular queue for LNET_EVENT_REPLY events that do not have their unlinked field set. They indicate an out-of-sequence REPLY/SEND combination, and LNet will issue a valid SEND event subsequently. However, the receipt of such an REPLY will be remembered in nlx_kcore_buffer::kb_ooo_reply, and its payload in the other "ooo" fields, so that when the out-of-order SEND arrives, this data can be used to generate the circular queue event.
  3. It will ignore LNET_EVENT_ACK events.
  4. It obtains the nlx_kcore_transfer_mc::ktm_bevq_lock spin lock.
  5. The bev_cqueue_pnext() subroutine is then used to locate the next buffer event structure in the circular buffer event queue which will be used to return the result.
  6. It copies the event payload from the LNet event to the buffer event structure. This includes the value of the unlinked field of the event, which must be copied to the nlx_core_buffer_event::cbe_unlinked field. For LNET_EVENT_UNLINK events, a -ECANCELED value is written to the nlx_core_buffer_event::cbe_status field and the nlx_core_buffer_event::cbe_unlinked field set to true. For LNET_EVENT_PUT events corresponding to unsolicited message delivery, the sender's TMID and Portal are encoded in the hdr_data. These values are decoded into the nlx_core_buffer_event::cbe_sender, along with the initiator's NID and PID. The nlx_core_buffer_event::cbe_sender is not set for other events.
  7. It invokes the bev_cqueue_put() subroutine to "produce" the event in the circular queue.
  8. It releases the nlx_kcore_transfer_mc::ktm_bevq_lock spin lock.
  9. It signals the nlx_kcore_transfer_mc::ktm_sem semaphore with the m0_semaphore_up() subroutine.

The (single) transport layer event handler thread blocks on the Core transfer machine semaphore in the Core API nlx_core_buf_event_wait() subroutine which uses the m0_semaphore_timeddown() subroutine internally to wait on the semaphore. When the Core API subroutine returns with an indication of the presence of events, the event handler thread consumes all the pending events with multiple calls to the Core API nlx_core_buf_event_get() subroutine, which uses the bev_cqueue_get() subroutine internally to get the next buffer event. Then the event handler thread repeats the call to the nlx_core_buf_event_wait() subroutine to once again block for additional events.

In the case of the user space transport, the blocking on the semaphore is done indirectly by the user space Core API's device driver in the kernel. It is required by the HLD that as many events as possible be consumed before the next context switch to the kernel must be made. To support this, the kernel Core nlx_core_buf_event_wait() subroutine takes a few additional steps to minimize the chance of returning when the queue is empty. After it obtains the semaphore with the m0_semaphore_timeddown() subroutine (i.e. the P operation succeeds), it attempts to clear the semaphore count by repeatedly calling the m0_semaphore_trydown() subroutine until it fails. It then checks the circular queue, and only if not empty will it return. This is illustrated with the following pseudo-code:

do {
if (rc < 0)
break; // timed out
; // exhaust the semaphore
} while (bev_cqueue_is_empty(&q)); // loop if empty

(The C++ style comments are used because of doxygen only - they are not permitted by the Motr style guide.)

LNet Receiving Unsolicited Messages

  1. Create an ME with LNetMEAttach() for the transfer machine and specify the portal, match and ignore bits. All receive buffers for a given TM will use a match bit value equal to the TM identifier in the higher order bits and zeros for the other bits. No ignore bits are set. The ME should be set up to unlink automatically as it will be used for all receive buffers of this transfer machine. The ME entry should be positioned at the end of the portal match list. There is no need to retain the ME handle beyond the subsequent LNetMDAttach() call.
  2. Create and attach an MD to the ME using LNetMDAttach(). The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t argument as follows:
    • Set the eq_handle to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).
    • Set the address of the nlx_kcore_buffer in the user_ptr field.
    • Pass in the KIOV from the nlx_kcore_buffer::kb_kiov.
    • Set the threshold value to the nlx_kcore_buffer::kb_max_recv_msgs value.
    • Set the max_size value to the nlx_kcore_buffer::kb_min_recv_size value.
    • Set the LNET_MD_OP_PUT, LNET_MD_MAX_SIZE and LNET_MD_KIOV flags in the options field.
  3. When a message arrives, an LNET_EVENT_PUT event will be delivered to the event queue, and will be processed as described in LNet Event Callback Processing.

LNet Sending Messages

  1. Create an MD using LNetMDBind() with each invocation of the nlx_core_buf_msg_send() subroutine. The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t argument as follows:
    • Set the eq_handle to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).
    • Set the address of the nlx_kcore_buffer in the user_ptr field.
    • Pass in the KIOV from the nlx_kcore_buffer::kb_kiov. The number of entries in the KIOV and the length field in the last element of the vector must be adjusted to reflect the desired byte count.
    • Set the LNET_MD_KIOV flag in the options field.
  2. Use the LNetPut() subroutine to send the MD to the destination. The match bits must set to the destination TM identifier in the higher order bits and zeros for the other bits. The hdr_data must be set to a value encoding the TMID (in the upper bits, like the match bits) and the portal (in the lower bits). No acknowledgment should be requested.
  3. When the message is sent, an LNET_EVENT_SEND event will be delivered to the event queue, and processed as described in LNet Event Callback Processing.
    Note
    The event does not indicate if the recipient was able to save the data, but merely that it left the host.

LNet Staging Passive Bulk Buffers

  1. Prior to invoking the nlx_core_buf_passive_recv() or the nlx_core_buf_passive_send() subroutines, the transport should use the nlx_core_buf_desc_encode() subroutine to assign unique match bits to the passive buffer. See Match Bits for Buffer Identification for details. The match bits should be encoded into the network buffer descriptor and independently conveyed to the remote active transport. The network descriptor also encodes the number of bytes to be transferred.
  2. Create an ME using LNetMEAttach(). Specify the portal and match_id fields as appropriate for the transfer machine. The buffer's match bits are obtained from the nlx_core_buffer::cb_match_bits field. No ignore bits are set. The ME should be set up to unlink automatically, so there is no need to save the handle for later use. The ME should be positioned at the end of the portal match list.
  3. Create and attach an MD to the ME using LNetMDAttach() with each invocation of the nlx_core_buf_passive_recv() or the nlx_core_buf_passive_send() subroutines. The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t argument as follows:
    • Set the eq_handle to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).
    • Set the address of the nlx_kcore_buffer in the user_ptr field.
    • Pass in the KIOV from the nlx_kcore_buffer::kb_kiov.
    • Set the LNET_MD_KIOV flag in the options field, along with either the LNET_MD_OP_PUT or the LNET_MD_OP_GET flag according to the direction of data transfer.
  4. When the bulk data transfer completes, either an LNET_EVENT_PUT or an LNET_EVENT_GET event will be delivered to the event queue, and will be processed as described in LNet Event Callback Processing.

LNet Active Bulk Read or Write

  1. Prior to invoking the nlx_core_buf_active_recv() or nlx_core_buf_active_send() subroutines, the transport should put the match bits of the remote passive buffer into the nlx_core_buffer::cb_match_bits field. The destination address of the remote transfer machine with the passive buffer should be set in the nlx_core_buffer::cb_addr field.
  2. Create an MD using LNetMDBind() with each invocation of the nlx_core_buf_active_recv() or nlx_core_buf_active_send() subroutines. The MD is set up to unlink automatically. Save the MD handle in the nlx_kcore_buffer::kb_mdh field. Set up the fields of the lnet_md_t argument as follows:
    • Set the eq_handle to identify the EQ associated with the transfer machine (nlx_kcore_transfer_mc::ktm_eqh).
    • Set the address of the nlx_kcore_buffer in the user_ptr field.
    • Pass in the KIOV from the nlx_kcore_buffer::kb_kiov. The number of entries in the KIOV and the length field in the last element of the vector must be adjusted to reflect the desired byte count.
    • Set the LNET_MD_KIOV flag in the options field.
    • In case of an active read, which uses LNetGet(), set the threshold value to 2 to accommodate both the SEND and the REPLY events. Otherwise set it to 1.
  3. Use the LNetGet() subroutine to initiate the active read or the LNetPut() subroutine to initiate the active write. The hdr_data is set to 0 in the case of LNetPut(). No acknowledgment should be requested. In the case of an LNetGet(), the field used to track out-of-order REPLY events (nlx_kcore_buffer::kb_ooo_reply) should be cleared before the operation is initiated.
  4. When a response to the LNetGet() or LNetPut() call completes, an LNET_EVENT_SEND event will be delivered to the event queue and should typically be ignored in the case of LNetGet(). See LNet Event Callback Processing for details.
  5. When the bulk data transfer for LNetGet() completes, an LNET_EVENT_REPLY event will be delivered to the event queue, and will be processed as described in LNet Event Callback Processing.

    Note
    LNet does not guarantee the order of the SEND and REPLY events associated with the LNetGet() operation. Also note that in the case of an LNetGet() operation, the SEND event does not indicate if the recipient was able to save the data, but merely that the request left the host.

LNet Canceling Operations

The kernel Core module provides no timeout capability. The transport may initiate a cancel operation using the nlx_core_buf_del() subroutine.

This will result in an LNetMDUnlink() subroutine call being issued for the buffer MD saved in the nlx_kcore_buffer::kb_mdh field. Cancellation may or may not take place - it depends upon whether the operation has started, and there is a race condition in making this call and concurrent delivery of an event associated with the MD.

Assuming success, the next event delivered for the buffer concerned will either be a LNET_EVENT_UNLINK event or the unlinked field will be set in the next completion event for the buffer. The events will be processed as described in LNet Event Callback Processing.

LNet properly handles the race condition between the automatic unlink of the MD and a call to LNetMDUnlink().

State Specification

  • The kernel Core module relies on the networking data structures to maintain the linkage between the data structures used by the Core module. It maintains no lists through data structures itself. As such, these lists can only be navigated by the Core API subroutines invoked by the transport (the "upper" layer) and not by the Core module's LNet callback subroutine (the "lower" layer).
  • The kernel Core API maintains a count of the total number of buffer event structures needed. This should be tested by the Core API's transfer machine invariant subroutine before returning from any buffer operation initiation call, and before returning from the nlx_core_buf_event_get() subroutine.
  • The kernel Core layer module depends on the LNet module in the kernel at run time. This dependency is captured by the Linux kernel module support that reference counts the usage of dependent modules.
  • The kernel Core layer modules explicitly tracks the events received for LNetGet() calls, in the nlx_kcore_buffer data structure associated with the call. This is because there are two events (SEND and REPLY) that are returned for this operation, and LNet does not guarantee their order of arrival, and the event processing logic is set up such that a circular buffer event must be created only upon receipt of the last operation event. Complicating the issue is that a cancellation response could be piggy-backed onto an in-order SEND. See LNet Event Callback Processing and LNet Active Bulk Read or Write for details.

Threading and Concurrency Model

  1. Generally speaking, API calls within the transport address space are protected by the serialization of the Motr Networking layer, typically the transfer machine mutex or the domain mutex. The nlx_core_buf_desc_encode() subroutine, for example, is fully protected by the transfer machine mutex held across the m0_net_buffer_add() subroutine call, so implicitly protects the match bit counter in the kernel Core's per TM private data.
  2. The Motr Networking layer serialization does not always suffice, as the kernel Core module has to support concurrent multiple transport instances in kernel and user space. Fortunately, the LNet API intrinsically provides considerable serialization support to the Core, as transfer machines are defined by the HLD to have disjoint addresses.
  3. Enforcement of the disjoint address semantics are protected by the kernel Core's nlx_kcore_mutex lock. The nlx_core_tm_start() and nlx_core_tm_stop() subroutines use this mutex internally for serialization and operation on the nlx_kcore_tms list threaded through the kernel Core's per-TM private data.
  4. The kernel Core module registers a callback subroutine with the LNet EQ defined per transfer machine. LNet requires that this subroutine be reentrant and non-blocking. The circular buffer event queue accessed from the callback requires a single producer, so the nlx_kcore_transfer_mc::ktm_bevq_lock spin lock is used to serialize its use across possible concurrent invocations. The time spent in the lock is minimal.
  5. The Core API does not support callbacks to indicate completion of an asynchronous buffer operation. Instead, the transport application must invoke the nlx_core_buf_event_wait() subroutine to block waiting for buffer events. Internally this call waits on the nlx_kcore_transfer_mc::ktm_sem semaphore. The semaphore is incremented each time an event is added to the buffer event queue.
  6. The event payload is actually delivered via a per transfer machine single producer, single consumer, lock-free circular buffer event queue. The only requirement for failure free operation is to ensure that there are sufficient event structures pre-allocated to the queue, plus one more to support the circular semantics. Multiple events may be dequeued between each call to the nlx_core_buf_event_wait() subroutine. Each such event is fetched by a call to the nlx_core_buf_event_get() subroutine, until the queue is exhausted. Note that the queue exists in memory shared between the transport and the kernel Core; the transport could be in the kernel or in user space.
  7. The API assumes that only a single transport thread will handle event processing. This is a critical assumption in the support for multiple messages in a single receive buffer, as it implicitly serializes the delivery of the events associated with any given receive buffer, thus the last event which unlinks the buffer is guaranteed to be delivered last.
  8. The Motr LNet transport driver releases all kernel resources associated with a user space domain when the device is released (the final close). It must not release buffer event objects or transfer machines while the LNet EQ callback requires them. The Kernel Core LNet EQ callback, nlx_kcore_eq_cb(), resets the association between a buffer and a transfer machine and increments the nlx_kcore_transfer_mc::ktm_sem semaphore while holding the nlx_kcore_transfer_mc::ktm_bevq_lock, and the callback never refers to either object after releasing the lock. The driver layer holds this lock as well while verifying that a buffer is not associated with a transfer machine, and, outside the lock, decrements the semaphore to wait for buffers to be unlinked by LNet (the device is being released, so no other thread will be decrementing the semaphore). This assures the buffer event objects and the transfer machine will remain until the final LNet event is delivered.
  9. LNet properly handles the race condition between the automatic unlink of the MD and a call to LNetMDUnlink().

NUMA optimizations

The LNet transport will initiate calls to the API on threads that may have specific process affinity assigned.

LNet offers no direct NUMA optimizations. In particular, event callbacks cannot be constrained to have any specific processor affinity. The API compensates for this lack of support by providing a level of indirection in event delivery: its callback handler simply copies the LNet event payload to an event delivery queue and notifies a transport event processing thread of the presence of the event. (See The Buffer Event Queue above). The transport event processing threads can be constrained to have any desired processor affinity.


Conformance


Unit Tests

The testing strategy is 2 pronged:

  • Tests with a fake LNet API. These tests will intercept the LNet subroutine calls. The real LNet data structures will be used by the Core API.
  • Tests with the real LNet API using the TCP loop back address. These tests will use the TCP loop back address. LNet on the test machine must be configured with the "tcp" network.
Test:
The correct sequence of LNet operations are issued for each type of buffer operation with a fake LNet API.
Test:
The callback subroutine properly delivers events to the buffer event queue, including single and multiple events for receive buffers with a fake LNet API.
Test:
The dynamic assignment of transfer machine identifiers with a fake LNet API.
Test:
Test the parsing of LNet addresses with the real LNet API.
Test:
Test each type of buffer operation, including single and multiple events for receive buffers with the real LNet API.

System Tests

System testing will be performed as part of the transport operation system test.


Analysis

  • Dynamic transfer machine identifier assignment is proportional to the number of transfer machines defined on the server, including kernel and all process space LNet transport instances.
  • The time taken to process an LNet event callback is in constant time.
  • The time taken for the transport to dequeue a pending buffer event depends upon the operating system scheduler. The algorithmic processing involved is in constant time.
  • The time taken to register a buffer is in constant time. The reference count of the buffer pages is not incremented, so there are no VM subsystem imposed delays.
  • The time taken to process outbound buffer operations is unpredictable, and depends, at the minimum, on current system load, other LNet users, and on the network load.

References

  • HLD of Motr LNet Transport : For documentation links, please refer to this file : doc/motr-design-doc-list.rst
  • The LNet API.