Motr  M0
The catalogue service (CAS)

Overview

Catalogue service exports BE btrees (be/btree.[ch]) to the network.


Definitions

Some of definitions are copied from HLD (see References).

  • Catalogue (index): a container for records. A catalogue is explicitly created and deleted by a user and has an identifier (m0_fid), assigned by the user.
  • Meta-catalogue: single catalogue containing information about all existing catalogues.
  • Catalogue-index: local (non-distributed) catalogue maintained by the index subsystem on each node in the pool. When a component catalogue is created for a distributed index, a record mapping the catalogue to the index is inserted in the catalogue-index. This record is used by the index repair and re-balance to find locations of other replicas.
  • Record: a key-value pair.
  • Key: an arbitrary sequence of bytes, used to identify a record in a catalogue.
  • Value: an arbitrary sequence of bytes, associated with a key.
  • Key order: total order, defined on keys within a given container. Iterating through the container, returns keys in this order. The order is defined as lexicographical order of keys, interpreted as bit-strings.
  • User: any Motr component or external application using a cas instance by sending fops to it.

Requirements

Additional requirements that are not covered in HLD (see References)

  • r.cas.bulk RPC bulk mechanism should be supported for transmission of CAS request that doesn't fit into one FOP.
  • r.cas.sync CAS service processes requests synchronously. If catalogue can't be locked immediately, then FOM is blocked.
  • r.cas.indices-list User should be able to request list of all indices in meta-index without prior knowledge of any index FID.
  • r.cas.addb ADDB statistics should be collected for:
    • Sizes for every requested key/value;
    • Read/write lock contention.

Dependencies

  • reqh service
  • fom, fom long lock
  • BE transaction, BE btree.

Design Highlights

  • Catalogue service implementation doesn't bother about distribution of key/value records between nodes in the cluster. Distributed key/value storage may be built on the client side.
  • Keys and values can be of arbitrary size, so they possibly don't fit into single FOP. Bulk RPC transmission should be used in this case (not implemented yet).
  • Per-FOM BE transaction is used to perform modifications of index B-trees.
  • B-tree for meta-index is stored in zero BE segment of motr global BE domain. Other B-trees are stored in BE segment of the request handler.

Logical Specification

State Specification

*                          +----------------------------+
*                          |                            |
*                          V    !cas_op_checked         |
*                      FOPH_INIT---------------+        |
*                          |                   |        |
*                          |                   V        |
*                          |             CAS_CHECK_PRE  |
*                          |                   |        |
*                          |                   V        |
*                          |               CAS_CHECK----+
*                          |                   |
*                          |                   | cas_op_check() failed
*                          V                   |
*                   [generic phases]        FAILURE
*                          .
*                          .                               meta_op
*                          .                       +---------------------+
*                          V                       |                     |
*                       TXN_INIT-------------->CAS_START<-------+        |
*                                                  |            |        |
*                                                  V            |        |
*                       TXN_OPEN<-----+      CAS_META_LOCK      |        |
*                          |          |            |            |        |
*                          |          |            V            |        |
*                   [generic phases]  |     CAS_META_LOOKUP     |        |
*                          .          |            |            |        |
*                          .          |            V            |        |
*           !meta_op       .          |   CAS_META_LOOKUP_DONE  |        |
*          +---------CAS_TXN_OPENED   |            |            |        |
*          |               |          |            |            |        |
*          V               |          |            |            |        |
*   CAS_META_UNLOCK------->|          |            |            |        |
*                          |          |            | ctg_crow   |        |
*                          |          |            +----------+ |        |
* +----------------------->|          |            |          | |        |
* |                        |          |            |          V |        |
* |      +-CAS_DTM0<-+     |          |            |   CAS_CTG_CROW_DONE |
* |      |           |     |          |            |                     |
* |      |           |dtm0 |          |            |                     |
* |      V           |     |          |            |                     |
* |   SUCCESS<-------+-CAS_LOOP<----+ |            |                     |
* |                        |        | |            V                     |
* |            index drop  |        | |    +->CAS_LOAD_KEY<--------------+
* |        +---------------+        | |    |       |
* |        |               |        | |    |       V
* |        V               |        | |    |  CAS_LOAD_VAL
* |  CAS_INSERT_TO_DEAD    |        | |    |       |
* |        |               |        | |    |       V
* |        V               |        | |    +--CAS_LOAD_DONE
* |  CAS_DELETE_FROM_META->|        | |            |
* |                        |        | |            |
* |        ctidx_op_needed |        | |            |
* |      +-----------------+        | |            |
* |      |                 |        | |            |
* |      V                 |        | |            |
* |  CAS_CTIDX------------>|        | |            |  index drop
* |                        |        | |            +-------------+
* |          index drop    |        | |            |             |
* |       +----------------+        | |            |             V
* |       |                |        | |            |<---CAS_DEAD_INDEX_LOCK
* |       V                |        | |            |
* +--CAS_IDROP_LOOP        |        | |            |
*         |                |        | |            |
*         V                |        | |            |
*  CAS_IDROP_LOCK_LOOP     |        | |            |
*   |       |              |        | |            |
*   |       V              |        | |            |
*   |   CAS_IDROP_LOCKED-->|        | |            |
*   |                      |        | |            |
*   V                      |        | |            |
* CAS_IDROP_START_GC       |        | |            |
*   |                      V        | |            |
*   V              CAS_PREPARE_SEND | |            |
* SUCCESS                  |        | |            |
*                          |next_key| |            |
*                          +------->| |            |
*                          |        | |        CAS_LOCK------------+
*                          V        | |            |               |
*                     CAS_SEND_KEY  | |            |meta_op        |
*                          |        | |            |               |
*                          V        | |            V               |
*                     CAS_SEND_VAL  | |      CAS_CTIDX_LOCK        |
*                          |        | |            |               |
*                          V        | |            V               |
*                       CAS_DONE----+ +--------CAS_PREP<-----------+
*
* 

Threading and Concurrency Model

Catalogues (including meta catalogue) are protected with "multiple readers/single writer" lock (see m0_cas_ctg::cc_lock). Writer starvation is not possible due to FOM long lock design, because writer has priority over readers.

B-tree structure has internal rwlock (m0_be_btree::bb_lock), but it's not convenient for usage inside FOM since it blocks the execution thread. Also, index should be locked before FOM BE TX credit calculation, because amount of credits depends on the height of BE tree. Index is unlocked when all necessary B-tree operations are done.

cas-lspec-layout

Memory for all indices (including meta-index) is allocated in the first segment (m0_be_domain_seg_first()) of Motr global BE domain (m0::i_be_dom). Address of a meta-index is stored in this segment dictionary. Meta-index btree stores information about existing indices (including itself) as key-value pairs, where key is a FID of an index and value is a pointer to it.


Conformance

  • i.cas.bulk Not designed yet.
  • i.cas.sync FOM is blocked on waiting of per-index FOM long lock.
  • i.cas.indices-list Meta-index includes itself and has the smallest FID (0,0), so it is always the first by key order. User can specify m0_cas_meta_fid as s starting FID to list indices from the start.
  • i.cas.addb Per-index long lock is initialised with non-NULL addb2 structure. Key/value size of each record in request is collected in cas_fom_addb2_descr() function.

References

  • HLD of the catalogue service : For documentation links, please refer to this file : doc/motr-design-doc-list.rst