Motr  M0
LNet Transport Device DLD

Overview

The Motr LNet Transport device provides user space access to the kernel Motr LNet Transport. The User Space Core implementation uses the device to communicate with the Kernel Core. The device provides a conduit through which information flows between the user space and kernel core layers, initiated by the user space layer. The specific operations that can be performed on the device are documented here. Hooks for unit testing the device are also discussed.


Definitions

  • HLD of Motr LNet Transport : For documentation links, please refer to this file : doc/motr-design-doc-list.rst
  • reference A reference to an object is stored in terms of a memory page and offset, rather than as a simple address pointer.
  • pin Keep a page of user memory from being paged out of physical memory and cause it to be paged in if it was previously paged out. Pinned user pages, like pages of kernel memory, are tracked by kernel page objects. Pinning a page does not assign it a kernel logical address; that requires subsequently mapping the page. A pinned page remained pinned until it is explicitly unpinned. Pinning may involve the use of shared, reference counted objects, but one should not depend on this for correctness.
  • map Assign a kernel logical address to a page of memory. A mapped page remains mapped until explicitly unmapped. Both kernel and pinned user pages can be mapped. Mapping may involve the use of shared, reference counted objects and addresses, but one should not depend on this for correctness. Each time a page is mapped, it may be assigned a different logical address.
  • unmap Remove the association of a kernel logical address from a page. After a page is unmapped, it has no logical address until it is explicitly remapped.
  • unpin Allow a previously pinned page to move freely, i.e. an unpinned page can be swapped out of physical memory. Any struct page pointers to the previously pinned page are no longer valid after a page is unpinned.

Requirements

  • r.m0.net.xprt.lnet.user-space The implementation must accommodate the needs of the user space LNet transport.
  • r.m0.net.xprt.lnet.dev.pin-objects The implementation must pin shared objects in kernel memory to ensure they will not disappear while in use.
  • r.m0.net.xprt.lnet.dev.resource-tracking The implementation must track all shared resources and ensure they are released properly, even after a user space error.
  • r.m0.net.xprt.lnet.dev.safe-sharing The implementation must ensure that references to shared object are valid.
  • r.m0.net.xprt.lnet.dev.assert-free The implementation must ensure that the kernel module will not assert due to invalid shared state.
  • r.m0.net.xprt.lnet.dev.minimal-mapping The implementation must avoid mapping many kernel pages for long periods of time, avoiding excessive use of kernel high memory page map.

Dependencies

  • LNet Transport Core Interface
    Several modifications are required on the Core interface itself:
    • A new nlx_core_buffer_event::cbe_kpvt pointer is required that can be set to refer to the new nlx_kcore_buffer_event object.
    • The nlx_core_tm_start() function is changed to remove the cepa and epp parameters. The cepa parameter is always the same as the lctm->ctm_addr. The Core API does not use a m0_net_end_point, so setting the epp at the core layer was inappropriate. The LNet XO layer, which does use the end point, is modified to allocate this end point in the nlx_tm_ev_worker() logic itself.
    • The user space transport must ensure that shared objects do not cross page boundaries. This applies only to shared core objects such as nlx_core_transfer_mc, not to buffer pages. Since object allocation is actually done in the LNet XO layer (except for the nlx_core_buffer_event), this requires that an allocation wrapper function, nlx_core_mem_alloc(), be added to the Core Interface, implemented separately in the kernel and user transports, because the kernel transport has no such limitation and the m0_alloc_aligned() API, which could be used to satisfy this requirement, requires page aligned data or greater in kernel space. A corresponding nlx_core_mem_free() is required to free the allocated memory.
  • LNet Transport Core User Private Interface
    Besides the existence of this interface, the following dependencies exist:
  • LNet Transport Core Kernel Private Interface
    Several modifications are required in this interface:
    • The kernel core objects with back pointers to the corresponding core objects must be changed to remove these pointers and replace them with use of nlx_core_kmem_loc objects. More details of this dependency are discussed in Shared Memory Management Strategy.
    • The bev_cqueue_pnext() and bev_cqueue_put() are modified such that they map and unmap the nlx_core_buffer_event object (atomic mapping must be used, because these functions are called from the LNet callback). This also requires use of nlx_core_kmem_loc references in the nlx_core_bev_link.
    • Many of the Core APIs implemented in the kernel must be refactored such that the portion that can be shared between the kernel-only transport and the kernel driver is moved into a new API. The Kernel Core API implementation is changed to perform pre-checks, call the new shared API and complete post-shared operations. The User Space Core tasks described in the User Space Core Logical Specification can be used as a guide for how to refactor the Kernel Core implementation. In addition, the operations on the nlx_kcore_ops structure guide the signatures of the refactored, shared operations.
    • The nlx_kcore_umd_init() function is changed to set the MD user_ptr to the nlx_kcore_buffer, not the nlx_core_buffer. kb_buffer_id and kb_qtype fields are added to the nlx_kcore_buffer and are set during nlx_kcore_buf_register() and nlx_kcore_umd_init() respectively. This allows the lnet event callback to execute without using the nlx_core_buffer. All other uses of the MD user_ptr field must be changed accordingly.
    • Various blocks of M0_PRE() assertions used to validate shared objects before they are referenced should be refactored into new invariant-style functions so the driver can perform the checks and return an error without causing an kernel assertion.
    • The kernel core must implement the new nlx_core_mem_alloc() and nlx_core_mem_free() required by the Core Interface.

Design Highlights

  • The device provides ioctl-based access to the Kernel LNet Core Interface.
  • Ioctl requests correspond roughly to the LNet Transport Core APIs.
  • Each user space m0_net_domain corresponds to opening a separate file descriptor.
  • The device driver tracks all resources associated with the file descriptor.
  • Well-defined patterns are used for sharing new resources between user and kernel space, referencing previously shared resources, and releasing shared resources.
  • The device driver can clean up a domain's resources in the case that the user program terminates prematurely.

Logical Specification

Component Overview

The LNet Device Driver is a layer between the user space transport core and the kernel space transport core. The driver layer provides a mechanism for the user space to interact with the Lustre LNet kernel module. It uses a subset of the kernel space transport core interface to implement this interaction.

  • HLD of Motr LNet Transport : For documentation links, please refer to this file : doc/motr-design-doc-list.rst

Refer specifically the Design Highlights component diagram.

For reference, the relationship between the various components of the LNet transport and the networking layer is illustrated in the following UML diagram.

lnet_xo.png
LNet Transport Objects

The LNet Device Driver has no sub-components. It has several internal functions that interact with the kernel space transport core layer.

Device Setup and Shutdown

The LNet device is registered with the kernel using the nlx_dev_init() function when the Motr Kernel module is loaded. This function is called by the existing nlx_core_init() function. The function performs the following tasks.

  • It registers the device with the kernel. The device is registered as a miscellaneous device named "m0lnet". As such, registration causes the device to appear as "/dev/m0lnet" in the device file system.
  • Sets a flag, nlx_dev_registered, denoting successful device registration.

The LNet device is deregistered with the kernel using the nlx_dev_fini() function when the Motr Kernel module is unloaded. This function is called by the existing nlx_core_fini() function. The function performs the following task.

  • If device registration was performed successfully, deregisters the device and resets the nlx_dev_registered flag.

Ioctl Request Behavior

The user space implementation of the LNet Transport Core Interface interacts with the LNet Transport Kernel Core via ioctl requests. The file descriptor required to make the ioctl requests is obtained during the Domain Initialization operation.

All further interaction with the device, until the file descriptor is closed, is via ioctl requests. Ioctl requests are served by the nlx_dev_ioctl() function, an implementation of the kernel file_operations::unlocked_ioctl() function. This function performs the following steps.

  • It validates the request.
  • It copies in (from user space) the parameter object corresponding to the specific request for most _IOW requests. Note that the requests that take pointers instead of parameter objects do not copy in, because the pointers are either references to kernel objects or shared objects to be pinned, not copied.
  • It calls a helper function to execute the command; specific helper functions are called out in the following sections. The helper function will call a kernel core operation to execute the behavior shared between user space and kernel transports. It does this indirectly from operations defined on the nlx_kcore_domain::kd_drv_ops operation object.
  • It copies out (to user space) the parameter object corresponding to the specific request for _IOR and _IOWR requests.
  • It returns the status, generally of the helper function. This status follows the typical 0 for success, -errno for failure, except as specified for certain helper functions.

The helper functions verify that the requested operation will not cause an assertion in the kernel core. This is done by performing the same checks the Kernel Core APIs would do, but without asserting. Instead, they log an ADDB record and return an error status when verification fails. The user space core can detect the error status and assert the user space process. The error code -EBADR is used to report verification failure. The error code -EFAULT is used to report invalid user addresses, such as for use in pinning pages or copying in user data. Specific helper functions may return additional well-defined errors.

See also
LNet Transport Device Internals

Shared Memory Management Strategy

Some ioctl requests have the side effect of pinning user pages in memory. However, the mapping of pages (i.e. kmap() or kmap_atomic() functions) is performed only while the pages are to be used, and then unmapped as soon as possible. The number of mappings available to kmap() is documented as being limited. Except as noted, kmap_atomic() is used in blocks of code that will not sleep to map the page associated with an object. Each time a shared object is mapped, its invariants are re-checked to ensure the page still contains the shared object. Each shared core object is required to fit within a single page to simplify mapping and sharing. The user space transport must ensure this requirement is met when it allocates core objects. Note that the pages of the m0_bufvec segments are not part of the shared nlx_core_buffer; they are referenced by the associated nlx_kcore_buffer object and are never mapped by the driver or kernel core layers.

The nlx_core_kmem_loc structure stores the page and offset of an object. It also stores a checksum to detect inadvertent corruption of the address or offset. This structure is used in place of pointers within structures used in kernel address space to reference shared (pinned) user space objects. Kernel core structures nlx_kcore_domain, nlx_kcore_transfer_mc, nlx_kcore_buffer and nlx_kcore_buffer_event refer to shared objects, and use fields such as nlx_kcore_domain::kd_cd_loc to store these references. Structures such as nlx_core_bev_link, nlx_core_bev_cqueue, while contained in shared objects also use nlx_core_kmem_loc, because these structures in turn need to reference yet other shared objects. When the shared object is needed, it is mapped (e.g. kmap_atomic() returns a pointer to the mapped page, and the code adds the corresponding offset to obtain a pointer to the object itself), used, and unmapped. The kernel pointer to the shared object is only used on the stack, never stored in a shared place. This allow for unsynchronized, concurrent access to shared objects, just as if they were always mapped.

The core data structures include kernel private pointers, such as nlx_core_transfer_mc::ctm_kpvt. These opaque (to the user space) values are used as parameters to ioctl requests. These pointers cannot be used directly, since it is possible they could be inadvertently corrupted. To address that, when such pointers are passed to ioctl requests, they are first validated using virt_addr_valid() to ensure they can be dereferenced in the kernel and then further validated using the appropriate invariant, nlx_kcore_tm_invariant() in the case above. If either validation fails, an error is returned, as discussed in Ioctl Request Behavior.

Domain Initialization

The LNet Transport Device Driver is first accessed during domain initialization. The user space core opens the device and performs an initial M0_LNET_DOM_INIT ioctl request.

In the kernel, the open() and ioctl() system calls are handled by the nlx_dev_open() and nlx_dev_ioctl() subroutines, respectively.

The nlx_dev_open() performs the following sequence.

The nlx_dev_ioctl() is described generally above. It uses the helper function nlx_dev_ioctl_dom_init() to complete kernel domain initialization. The following tasks are performed.

Domain Finalization

During normal domain finalization, the user space core closes its file descriptor after the upper layers have already cleaned up other resources (buffers and transfer machines). It is also possible that the user space process closes the file descriptor without first finalizing the associated domain resources, such as in the case that the user space process fails.

In the kernel the close() system call is handled by the nlx_dev_close() subroutine. Technically, nlx_dev_close() is called once by the kernel when the last reference to the file is closed (e.g. if the file descriptor had been duplicated). The subroutine performs the following sequence.

Buffer Registration and Deregistration

While registering a buffer, the user space core performs a M0_LNET_BUF_REGISTER ioctl request.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_buf_register() to complete kernel buffer registration. The following tasks are performed.

While deregistering a buffer, the user space core performs a M0_LNET_BUF_DEREGISTER ioctl request.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_buf_deregister() to complete kernel buffer deregistration. The following tasks are performed.

Managing the Buffer Event Queue

The nlx_core_new_blessed_bev() helper allocates and blesses buffer event objects. In user space, blessing the object requires interacting with the kernel by way of the M0_LNET_BEV_BLESS ioctl request.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_bev_bless() to complete blessing the buffer event object. The following tasks are performed.

Buffer event objects are never removed from the buffer event queue until the transfer machine is stopped.

See also
Stopping a Transfer Machine

Starting a Transfer Machine

While starting a transfer machine, the user space core performs a M0_LNET_TM_START ioctl request.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_tm_start() to complete starting the transfer machine. The following tasks are performed.

Stopping a Transfer Machine

While stopping a transfer machine, the user space core performs a M0_LNET_TM_STOP ioctl request.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_tm_stop() to complete stopping the transfer machine. The following tasks are performed.

Transfer Machine Buffer Queue Operations

Several LNet core interfaces operate on buffers and transfer machine queues. In all user transport cases, the shared objects, nlx_core_buffer and nlx_core_transfer_mc, must have been previously shared with the kernel, through use of the M0_LNET_BUF_REGISTER and M0_LNET_TM_START ioctl requests, respectively.

The ioctl requests available to the user space core for managing buffers and transfer machine buffer queues are as follows.

The ioctl requests are handled by the following helper functions, respectively.

  • nlx_dev_ioctl_buf_msg_recv()
  • nlx_dev_ioctl_buf_msg_send()
  • nlx_dev_ioctl_buf_active_recv()
  • nlx_dev_ioctl_buf_active_send()
  • nlx_dev_ioctl_buf_passive_recv()
  • nlx_dev_ioctl_buf_passive_send()
  • nlx_dev_ioctl_buf_del()

These helper functions each perform similar tasks.

Waiting for Buffer Events

To wait for buffer events, the user space core performs a M0_LNET_BUF_EVENT_WAIT ioctl request.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_buf_event_wait() to perform the wait operation. The following tasks are performed.

Node Identifier Support

The user space core uses the M0_LNET_NIDSTR_DECODE and M0_LNET_NIDSTR_ENCODE requests to decode and encode NID strings, respectively.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_nidstr_decode() to decode the string. The following tasks are performed.

  • The parameter is validated to ensure no assertions will occur.
  • The libcfs_str2nid() function is called to convert the string to a NID.
  • In the case the result is LNET_NID_ANY, -EINVAL is returned, otherwise the dn_nid field is set.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_nidstr_encode() to decode the string. The following tasks are performed.

  • The parameter is validated to ensure no assertions will occur.
  • The libcfs_nid2str() function is called to convert the string to a NID.
  • The resulting string is copied to the dn_buf field.

The user space core uses the M0_LNET_NIDSTRS_GET to obtain the list of NID strings for the local LNet interfaces.

The nlx_dev_ioctl() subroutine uses the helper function nlx_dev_ioctl_nidstrs_get() to decode the string. The following tasks are performed.

  • The parameters are validated to ensure no assertions will occur.
  • The nlx_core_nidstrs_get() API is called to get the list of NID strings.
  • The buffer size required to store the strings is computed (sum of the string lengths of the NID strings, plus trailing nuls, plus one).
  • A temporary buffer of the required size is allocated.
  • The NID strings are copied consecutively into the buffer. Each NID string is nul terminated and an extra nul is written after the final NID string.
  • The contents of the buffer is copied to the user space buffer.
  • The nlx_core_nidstrs_put() API is called to release the list of NID strings.
  • The temporary buffer is freed.
  • The number of NID strings is returned on success; nlx_dev_ioctl() returns this positive number instead of the typical 0 for success.
  • The value -EFBIG is returned if the buffer is not big enough.

State Specification

The LNet device driver does not introduce its own state model but operates within the frameworks defined by the Motr Networking Module and the Kernel device driver interface. In general, resources are pinned and allocated when an object is first shared with the kernel by the user space process and are freed and unpinned when the user space requests. To ensure there is no resource leakage, remaining resources are freed when the nlx_dev_close() API is called.

The resources managed by the driver are tracked by the following lists:

Each nlx_kcore_domain object has 2 valid states which can be determined by inspecting the nlx_kcore_domain::kd_cd_loc field:

  • nlx_core_kmem_loc_is_empty(&kd_cd_loc): The device is newly opened and the M0_LNET_DOM_INIT ioctl request has not yet been performed.
  • nlx_core_kmem_loc_invariant(&kd_cd_loc): The M0_LNET_DOM_INIT ioctl request has been performed, associating it with a nlx_core_domain object. In this state, the nlx_kcore_domain is ready for use and remains in this state until finalized.

Threading and Concurrency Model

The LNet device driver has no threads of its own. It operates within the context of a user space process and a kernel thread operating on behalf of that process. All operations are invoked through the Linux device driver interface, specifically the operations defined on the nlx_dev_file_ops object. The nlx_dev_open() and nlx_dev_close() are guaranteed to be called once each for each kernel file object, and calls to these operations are guaranteed to not overlap with calls to the nlx_dev_ioctl() operation. However, multiple calls to nlx_dev_ioctl() may occur simultaneously on different threads.

Synchronization of device driver resources is controlled by a single mutex per domain, the nlx_kcore_domain::kd_drv_mutex. This mutex must be held while manipulating the resource lists, nlx_kcore_domain::kd_drv_tms, nlx_kcore_domain::kd_drv_bufs and nlx_kcore_transfer_mc::ktm_drv_bevs.

The mutex may also be used to serialize driver ioctl requests, such as in the case of M0_LNET_DOM_INIT.

The driver mutex must be obtained before any other Net or Kernel Core mutex.

Mapping of nlx_core_kmem_loc object references can be performed without synchronization, because the nlx_core_kmem_loc never changes after an object is pinned, and the mapped pointer is specified to never be stored in a shared location, i.e. only on the stack. The functions that unpin shared objects have invariants and pre-conditions to ensure that the objects are no longer in use and can be unpinned without causing a mapping failure.

Cleanup of kernel resources for user domains synchronizes with the Kernel Core LNet EQ callback by use of the nlx_kcore_transfer_mc::ktm_bevq_lock and the nlx_kcore_transfer_mc::ktm_sem, as discussed in Threading and Concurrency Model.

NUMA optimizations

The LNet device driver does not allocate threads. The user space application can control thread processor affiliation by confining the threads it uses to access the device driver.


Conformance


Unit Tests

LNet Device driver unit tests focus on the covering the common code paths. Code paths involving most Kernel LNet Core operations and the device wrappers will be handled as part of testing the user transport. Even so, some tests are most easily performed by coordinating user space code with kernel unit tests. The following strategy will be used:

  • When the LNet unit test suite is initialized in the kernel, it creates a /proc/m0_lnet_ut file, registering read and write file operations.
  • The kernel UT waits (e.g. on a condition variable with a timeout) for the user space program to synchronize. It may time out and fail the UT if the user space program does not synchronize quickly enough, e.g. after a few seconds.
  • A user space program is started concurrently with the kernel unit tests.
  • The user space program waits for the /proc/m0_lnet_ut to appear.
  • The user space program writes a message to the /proc/m0_lnet_ut to synchronize with the kernel unit test.
  • The write system call operation registered for /proc/m0_lnet_ut signals the condition variable that the kernel UT is waiting on.
  • The user space program loops.
    • The user space program reads the /proc/m0_lnet_ut for instructions.
    • Each instruction tells the user space program which test to perform; there is a special instruction to tell the user space program the unit test is complete.
    • The user space program writes the test result back.
  • When the LNet unit test suite is finalized in the kernel, the /proc/m0_lnet_ut file is removed.

While ioctl requests on the /dec/m0lnet device could be used for such coordination, this would result in unit test code being mixed into the production code. The use of a /proc file for coordinating unit tests ensures this is not the case.

To enable unit testing of the device layer without requiring full kernel core behavior, the device layer accesses kernel core operations indirectly via the nlx_kcore_domain::kd_drv_ops operation structure. During unit tests, these operations can be changed to call mock operations instead of the real kernel core operations. This allows testing of things such as pinning and mapping pages without causing real core behavior to occur.

Test:
Initializing the device causes it to be registered and visible in the file system.
Test:
The device can be opened and closed.
Test:
Reading or writing the device fails.
Test:
Unsupported ioctl requests fail.
Test:
A nlx_core_domain can be initialized and finalized, testing common code paths and the strategy of pinning and unpinning pages.
Test:
A nlx_core_domain is initialized, and several nlx_core_transfer_mc objects can be started and then stopped, the domain finalized and the device is closed. No cleanup is necessary.
Test:
A nlx_core_domain is initialized, and the same nlx_core_transfer_mc object is started twice, the error is detected. The remaining transfer machine is stopped. The device is closed. No cleanup is necessary.
Test:
A nlx_core_domain, and several nlx_core_transfer_mc objects can be registered, then the device is closed, and cleanup occurs.

Buffer and buffer event management tests and more advanced domain and transfer machine test will be added as part of testing the user space transport.


System Tests

System testing will be performed as part of the transport operation system test.


Analysis

  • The algorithmic complexity of ioctl requests is constant, except
    • the complexity of pinning a buffer varies with the number of pages in the buffer,
    • the complexity of stopping a transfer machine is proportional to the number of buffer events pinned.
  • The time to pin or kmap() a page is unpredictable, and depends, at the minimum, on current system load, memory consumption and other LNet users. For this reason, kmap_atomic() should be used when a shared page can be used without blocking.
  • The driver layer consumes a small amount of additional memory in the form of additional fields in the various kernel core objects.
  • The use of stack pointers instead of pointers within kernel core objects while mapping shared objects avoids the need to synchronize the use of pointers within the kernel core objects themselves.

References