Motr
M0
|
The Motr LNet Transport device provides user space access to the kernel Motr LNet Transport. The User Space Core implementation uses the device to communicate with the Kernel Core. The device provides a conduit through which information flows between the user space and kernel core layers, initiated by the user space layer. The specific operations that can be performed on the device are documented here. Hooks for unit testing the device are also discussed.
page
objects. Pinning a page does not assign it a kernel logical address; that requires subsequently mapping the page. A pinned page remained pinned until it is explicitly unpinned. Pinning may involve the use of shared, reference counted objects, but one should not depend on this for correctness.page
pointers to the previously pinned page are no longer valid after a page is unpinned.nlx_core_buffer_event::cbe_kpvt
pointer is required that can be set to refer to the new nlx_kcore_buffer_event
object.nlx_core_tm_start()
function is changed to remove the cepa
and epp
parameters. The cepa
parameter is always the same as the lctm->ctm_addr
. The Core API does not use a m0_net_end_point
, so setting the epp
at the core layer was inappropriate. The LNet XO layer, which does use the end point, is modified to allocate this end point in the nlx_tm_ev_worker()
logic itself.nlx_core_transfer_mc
, not to buffer pages. Since object allocation is actually done in the LNet XO layer (except for the nlx_core_buffer_event
), this requires that an allocation wrapper function, nlx_core_mem_alloc()
, be added to the Core Interface, implemented separately in the kernel and user transports, because the kernel transport has no such limitation and the m0_alloc_aligned()
API, which could be used to satisfy this requirement, requires page aligned data or greater in kernel space. A corresponding nlx_core_mem_free()
is required to free the allocated memory.nlx_core_mem_alloc()
and nlx_core_mem_free()
required by the Core Interface.nlx_core_kmem_loc
objects. More details of this dependency are discussed in Shared Memory Management Strategy.bev_cqueue_pnext()
and bev_cqueue_put()
are modified such that they map and unmap the nlx_core_buffer_event
object (atomic mapping must be used, because these functions are called from the LNet callback). This also requires use of nlx_core_kmem_loc
references in the nlx_core_bev_link
.nlx_kcore_ops
structure guide the signatures of the refactored, shared operations.nlx_kcore_umd_init()
function is changed to set the MD user_ptr
to the nlx_kcore_buffer
, not the nlx_core_buffer
. kb_buffer_id
and kb_qtype
fields are added to the nlx_kcore_buffer
and are set during nlx_kcore_buf_register()
and nlx_kcore_umd_init()
respectively. This allows the lnet event callback to execute without using the nlx_core_buffer
. All other uses of the MD user_ptr
field must be changed accordingly.M0_PRE()
assertions used to validate shared objects before they are referenced should be refactored into new invariant-style functions so the driver can perform the checks and return an error without causing an kernel assertion.nlx_core_mem_alloc()
and nlx_core_mem_free()
required by the Core Interface.m0_net_domain
corresponds to opening a separate file descriptor.The LNet Device Driver is a layer between the user space transport core and the kernel space transport core. The driver layer provides a mechanism for the user space to interact with the Lustre LNet kernel module. It uses a subset of the kernel space transport core interface to implement this interaction.
Refer specifically the Design Highlights component diagram.
For reference, the relationship between the various components of the LNet transport and the networking layer is illustrated in the following UML diagram.
The LNet Device Driver has no sub-components. It has several internal functions that interact with the kernel space transport core layer.
The LNet device is registered with the kernel using the nlx_dev_init()
function when the Motr Kernel module is loaded. This function is called by the existing nlx_core_init()
function. The function performs the following tasks.
nlx_dev_registered
, denoting successful device registration.The LNet device is deregistered with the kernel using the nlx_dev_fini()
function when the Motr Kernel module is unloaded. This function is called by the existing nlx_core_fini()
function. The function performs the following task.
nlx_dev_registered
flag.The user space implementation of the LNet Transport Core Interface interacts with the LNet Transport Kernel Core via ioctl requests. The file descriptor required to make the ioctl requests is obtained during the Domain Initialization operation.
All further interaction with the device, until the file descriptor is closed, is via ioctl requests. Ioctl requests are served by the nlx_dev_ioctl()
function, an implementation of the kernel file_operations::unlocked_ioctl()
function. This function performs the following steps.
nlx_kcore_domain::kd_drv_ops
operation object.The helper functions verify that the requested operation will not cause an assertion in the kernel core. This is done by performing the same checks the Kernel Core APIs would do, but without asserting. Instead, they log an ADDB record and return an error status when verification fails. The user space core can detect the error status and assert the user space process. The error code -EBADR
is used to report verification failure. The error code -EFAULT
is used to report invalid user addresses, such as for use in pinning pages or copying in user data. Specific helper functions may return additional well-defined errors.
Some ioctl requests have the side effect of pinning user pages in memory. However, the mapping of pages (i.e. kmap()
or kmap_atomic()
functions) is performed only while the pages are to be used, and then unmapped as soon as possible. The number of mappings available to kmap()
is documented as being limited. Except as noted, kmap_atomic()
is used in blocks of code that will not sleep to map the page associated with an object. Each time a shared object is mapped, its invariants are re-checked to ensure the page still contains the shared object. Each shared core object is required to fit within a single page to simplify mapping and sharing. The user space transport must ensure this requirement is met when it allocates core objects. Note that the pages of the m0_bufvec
segments are not part of the shared nlx_core_buffer
; they are referenced by the associated nlx_kcore_buffer
object and are never mapped by the driver or kernel core layers.
The nlx_core_kmem_loc
structure stores the page and offset of an object. It also stores a checksum to detect inadvertent corruption of the address or offset. This structure is used in place of pointers within structures used in kernel address space to reference shared (pinned) user space objects. Kernel core structures nlx_kcore_domain
, nlx_kcore_transfer_mc
, nlx_kcore_buffer
and nlx_kcore_buffer_event
refer to shared objects, and use fields such as nlx_kcore_domain::kd_cd_loc to store these references. Structures such as nlx_core_bev_link
, nlx_core_bev_cqueue
, while contained in shared objects also use nlx_core_kmem_loc
, because these structures in turn need to reference yet other shared objects. When the shared object is needed, it is mapped (e.g. kmap_atomic()
returns a pointer to the mapped page, and the code adds the corresponding offset to obtain a pointer to the object itself), used, and unmapped. The kernel pointer to the shared object is only used on the stack, never stored in a shared place. This allow for unsynchronized, concurrent access to shared objects, just as if they were always mapped.
The core data structures include kernel private pointers, such as nlx_core_transfer_mc::ctm_kpvt
. These opaque (to the user space) values are used as parameters to ioctl requests. These pointers cannot be used directly, since it is possible they could be inadvertently corrupted. To address that, when such pointers are passed to ioctl requests, they are first validated using virt_addr_valid()
to ensure they can be dereferenced in the kernel and then further validated using the appropriate invariant, nlx_kcore_tm_invariant()
in the case above. If either validation fails, an error is returned, as discussed in Ioctl Request Behavior.
The LNet Transport Device Driver is first accessed during domain initialization. The user space core opens the device and performs an initial M0_LNET_DOM_INIT
ioctl request.
In the kernel, the open()
and ioctl()
system calls are handled by the nlx_dev_open()
and nlx_dev_ioctl()
subroutines, respectively.
The nlx_dev_open()
performs the following sequence.
nlx_kcore_domain
object, initializes it using nlx_kcore_kcore_dom_init()
and assigns the object to the file->private_data
field.The nlx_dev_ioctl()
is described generally above. It uses the helper function nlx_dev_ioctl_dom_init()
to complete kernel domain initialization. The following tasks are performed.
nlx_kcore_domain::kd_drv_mutex()
is locked.nlx_kcore_domain
is verified to ensure the core domain is not already initialized.nlx_core_domain
object is pinned in kernel memory.nlx_core_domain
is saved in the nlx_kcore_domain::kd_cd_loc
.nlx_core_domain
is mapped and validated to ensure no assertions will occur.nlx_core_domain
is initialized by nlx_kcore_ops::ko_dom_init()
.m0_lnet_dev_dom_init_params
object.nlx_core_domain
is unmapped (it remains pinned).nlx_kcore_domain::kd_drv_mutex()
is unlocked.During normal domain finalization, the user space core closes its file descriptor after the upper layers have already cleaned up other resources (buffers and transfer machines). It is also possible that the user space process closes the file descriptor without first finalizing the associated domain resources, such as in the case that the user space process fails.
In the kernel the close()
system call is handled by the nlx_dev_close()
subroutine. Technically, nlx_dev_close()
is called once by the kernel when the last reference to the file is closed (e.g. if the file descriptor had been duplicated). The subroutine performs the following sequence.
nlx_kcore_domain::kd_drv_tms
and nlx_kcore_domain::kd_drv_bufs
are empty.nlx_core_transfer_mc
objects must be unpinned.nlx_core_buffer_event
objects must be unpinned.nlx_kcore_buffer_event
object is freed.nlx_kcore_transfer_mc
object is freed.nlx_core_buffer
objects must be unpinned.nlx_kcore_buffer
object is freed.nlx_kcore_ops::ko_dom_fini()
to finalize the core domain.nlx_core_domain
object, resetting the the nlx_kcore_domain::kd_cd_loc
.file->private_data
.nlx_kcore_kcore_dom_fini()
to finalize the nlx_kcore_domain
object.nlx_kcore_domain
object.While registering a buffer, the user space core performs a M0_LNET_BUF_REGISTER
ioctl request.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_buf_register()
to complete kernel buffer registration. The following tasks are performed.
m0_bufvec::ov_buf
and m0_bufvec::ov_vec::v_count
are copied in, temporarily (to avoid issues of either list crossing page boundaries that might occur by mapping the pages directly), and the corresponding fields of the m0_lnet_dev_buf_register_params::dbr_bvec
is updated to refer to the copies.nlx_core_buffer
, m0_lnet_dev_buf_register_params::dbr_lcbuf
, is pinned in kernel memory.nlx_core_buffer
is saved in the nlx_kcore_buffer::kb_cb_loc
.nlx_core_buffer
is mapped and validated to ensure no assertions will occur. It is also checked to ensure it is not already associated with a nlx_kcore_buffer
object.nlx_kcore_ops::ko_buf_register()
is used to initialize the nlx_core_buffer
and nlx_kcore_buffer
objects.nlx_kcore_buffer_uva_to_kiov()
is used to pin the pages of the buffer segments and initialize the nlx_kcore_buffer::kb_kiov
.nlx_core_buffer
is unmapped (it remains pinned).nlx_kcore_buffer
is added to the nlx_kcore_domain::kd_drv_bufs
list.m0_lnet_dev_buf_register_params::dbr_bvec
are freed.While deregistering a buffer, the user space core performs a M0_LNET_BUF_DEREGISTER
ioctl request.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_buf_deregister()
to complete kernel buffer deregistration. The following tasks are performed.
nlx_kcore_buffer::kb_kiov
, are unpinned.nlx_kcore_domain::kd_drv_bufs
list.nlx_core_buffer
is mapped.nlx_kcore_ops::ko_buf_deregister()
is used to deregister the buffer.nlx_kcore_buffer
object is freed.nlx_core_buffer
is unmapped and unpinned.The nlx_core_new_blessed_bev()
helper allocates and blesses buffer event objects. In user space, blessing the object requires interacting with the kernel by way of the M0_LNET_BEV_BLESS
ioctl request.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_bev_bless()
to complete blessing the buffer event object. The following tasks are performed.
nlx_core_buffer_event
is pinned in kernel memory.nlx_kcore_buffer_event
object is allocated and initialized.nlx_kcore_buffer_event
object.nlx_core_buffer_event
is mapped, validated to ensure no assertions will occur, and checked to ensure it is not already associated with a nlx_kcore_buffer_event
object.bev_link_bless()
function is called to bless the object.nlx_core_buffer_event
is unmapped (it remains pinned).nlx_kcore_buffer_event
object is added to the nlx_kcore_transfer_mc::ktm_drv_bevs
list.Buffer event objects are never removed from the buffer event queue until the transfer machine is stopped.
While starting a transfer machine, the user space core performs a M0_LNET_TM_START
ioctl request.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_tm_start()
to complete starting the transfer machine. The following tasks are performed.
nlx_core_transfer_mc
object is pinned in kernel memory.nlx_kcore_transfer_mc
is allocated.nlx_kcore_transfer_mc
object.nlx_core_transfer_mc
is mapped using kmap()
because the core operation may sleep.nlx_core_transfer_mc
is checked to ensure it is not already associated with a nlx_kcore_transfer_mc
object and that it will not cause assertions.nlx_kcore_ops::ko_tm_start()
is used to complete the kernel TM start.nlx_core_transfer_mc
is unmapped (it remains pinned).nlx_kcore_transfer_mc
is added to the nlx_kcore_domain::kd_drv_tms
list.While stopping a transfer machine, the user space core performs a M0_LNET_TM_STOP
ioctl request.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_tm_stop()
to complete stopping the transfer machine. The following tasks are performed.
nlx_kcore_domain::kd_drv_tms
list.nlx_kcore_transfer_mc::ktm_drv_bevs
list are unpinned and their corresponding nlx_kcore_buffer_event
objects freed.nlx_core_transfer_mc
is mapped.nlx_kcore_ops::ko_tm_stop()
is used to stop the transfer machine.nlx_core_transfer_mc
is unmapped and unpinned.Several LNet core interfaces operate on buffers and transfer machine queues. In all user transport cases, the shared objects, nlx_core_buffer
and nlx_core_transfer_mc
, must have been previously shared with the kernel, through use of the M0_LNET_BUF_REGISTER
and M0_LNET_TM_START
ioctl requests, respectively.
The ioctl requests available to the user space core for managing buffers and transfer machine buffer queues are as follows.
M0_LNET_BUF_MSG_RECV
M0_LNET_BUF_MSG_SEND
M0_LNET_BUF_ACTIVE_RECV
M0_LNET_BUF_ACTIVE_SEND
M0_LNET_BUF_PASSIVE_RECV
M0_LNET_BUF_PASSIVE_SEND
M0_LNET_BUF_DEL
The ioctl requests are handled by the following helper functions, respectively.
nlx_dev_ioctl_buf_msg_recv()
nlx_dev_ioctl_buf_msg_send()
nlx_dev_ioctl_buf_active_recv()
nlx_dev_ioctl_buf_active_send()
nlx_dev_ioctl_buf_passive_recv()
nlx_dev_ioctl_buf_passive_send()
nlx_dev_ioctl_buf_del()
These helper functions each perform similar tasks.
nlx_core_transfer_mc
and nlx_core_buffer
are mapped using kmap()
because the core operations may sleep.nlx_core_transfer_mc
and nlx_core_buffer
are unmapped.To wait for buffer events, the user space core performs a M0_LNET_BUF_EVENT_WAIT
ioctl request.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_buf_event_wait()
to perform the wait operation. The following tasks are performed.
nlx_kcore_ops::ko_buf_event_wait()
function is called.The user space core uses the M0_LNET_NIDSTR_DECODE
and M0_LNET_NIDSTR_ENCODE
requests to decode and encode NID strings, respectively.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_nidstr_decode()
to decode the string. The following tasks are performed.
libcfs_str2nid()
function is called to convert the string to a NID.dn_nid
field is set.The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_nidstr_encode()
to decode the string. The following tasks are performed.
libcfs_nid2str()
function is called to convert the string to a NID.dn_buf
field.The user space core uses the M0_LNET_NIDSTRS_GET
to obtain the list of NID strings for the local LNet interfaces.
The nlx_dev_ioctl()
subroutine uses the helper function nlx_dev_ioctl_nidstrs_get()
to decode the string. The following tasks are performed.
nlx_core_nidstrs_get()
API is called to get the list of NID strings.nlx_core_nidstrs_put()
API is called to release the list of NID strings.nlx_dev_ioctl()
returns this positive number instead of the typical 0 for success.The LNet device driver does not introduce its own state model but operates within the frameworks defined by the Motr Networking Module and the Kernel device driver interface. In general, resources are pinned and allocated when an object is first shared with the kernel by the user space process and are freed and unpinned when the user space requests. To ensure there is no resource leakage, remaining resources are freed when the nlx_dev_close()
API is called.
The resources managed by the driver are tracked by the following lists:
nlx_kcore_domain::kd_cd_loc
(a single item)nlx_kcore_domain::kd_drv_tms
nlx_kcore_domain::kd_drv_bufs
nlx_kcore_transfer_mc::ktm_drv_bevs
Each nlx_kcore_domain
object has 2 valid states which can be determined by inspecting the nlx_kcore_domain::kd_cd_loc
field:
nlx_core_kmem_loc_is_empty(&kd_cd_loc)
: The device is newly opened and the M0_LNET_DOM_INIT
ioctl request has not yet been performed.nlx_core_kmem_loc_invariant(&kd_cd_loc)
: The M0_LNET_DOM_INIT
ioctl request has been performed, associating it with a nlx_core_domain
object. In this state, the nlx_kcore_domain
is ready for use and remains in this state until finalized.The LNet device driver has no threads of its own. It operates within the context of a user space process and a kernel thread operating on behalf of that process. All operations are invoked through the Linux device driver interface, specifically the operations defined on the nlx_dev_file_ops
object. The nlx_dev_open()
and nlx_dev_close()
are guaranteed to be called once each for each kernel file
object, and calls to these operations are guaranteed to not overlap with calls to the nlx_dev_ioctl()
operation. However, multiple calls to nlx_dev_ioctl()
may occur simultaneously on different threads.
Synchronization of device driver resources is controlled by a single mutex per domain, the nlx_kcore_domain::kd_drv_mutex
. This mutex must be held while manipulating the resource lists, nlx_kcore_domain::kd_drv_tms
, nlx_kcore_domain::kd_drv_bufs
and nlx_kcore_transfer_mc::ktm_drv_bevs
.
The mutex may also be used to serialize driver ioctl requests, such as in the case of M0_LNET_DOM_INIT
.
The driver mutex must be obtained before any other Net or Kernel Core mutex.
Mapping of nlx_core_kmem_loc
object references can be performed without synchronization, because the nlx_core_kmem_loc
never changes after an object is pinned, and the mapped pointer is specified to never be stored in a shared location, i.e. only on the stack. The functions that unpin shared objects have invariants and pre-conditions to ensure that the objects are no longer in use and can be unpinned without causing a mapping failure.
Cleanup of kernel resources for user domains synchronizes with the Kernel Core LNet EQ callback by use of the nlx_kcore_transfer_mc::ktm_bevq_lock and the nlx_kcore_transfer_mc::ktm_sem, as discussed in Threading and Concurrency Model.
The LNet device driver does not allocate threads. The user space application can control thread processor affiliation by confining the threads it uses to access the device driver.
kmap_atomic()
when possible. Domain Finalization, Ioctl Request Behavior.LNet Device driver unit tests focus on the covering the common code paths. Code paths involving most Kernel LNet Core operations and the device wrappers will be handled as part of testing the user transport. Even so, some tests are most easily performed by coordinating user space code with kernel unit tests. The following strategy will be used:
While ioctl requests on the /dec/m0lnet device could be used for such coordination, this would result in unit test code being mixed into the production code. The use of a /proc file for coordinating unit tests ensures this is not the case.
To enable unit testing of the device layer without requiring full kernel core behavior, the device layer accesses kernel core operations indirectly via the nlx_kcore_domain::kd_drv_ops
operation structure. During unit tests, these operations can be changed to call mock operations instead of the real kernel core operations. This allows testing of things such as pinning and mapping pages without causing real core behavior to occur.
Buffer and buffer event management tests and more advanced domain and transfer machine test will be added as part of testing the user space transport.
System testing will be performed as part of the transport operation system test.
kmap()
a page is unpredictable, and depends, at the minimum, on current system load, memory consumption and other LNet users. For this reason, kmap_atomic()
should be used when a shared page can be used without blocking.