Motr  M0
DLD

- Component Overview


Recovery overview

BE allows users to modify segments in transactional way. A segment is backed with a linux stob, which doesn't provide atomic writes. BE should have consistent segments even after crash. Therefore BE should have a part that can recover BE segments data after crash. This part is called recovery.

Definitions

  • Recovery is process of consistency reconstruction of segments, which current BE domain contains of.
  • Log is persistent storage where transactions are written before they are placed to segment.
  • Valid group is a transaction group that is fully logged and contains complete commit block with proper magic and checksum.
  • Log-only group is group that was completely logged but it's not known if it was placed.
  • Dirty bit is a flag that indicates whether BE was shut down correctly or not during previous run.

Requirements

  • On successful recovery completion segments have to contain only complete and not partial transactions.
  • Recovery shouldn't take long after normal shutdown;
  • Recovery time is not important (yet);
  • Log is stored on persistent storage with random access.

Recovery should be started and succeed after:

  • Power loss;
  • m0d crash:
    • OOM m0d kill;
    • SIGSEGV or other signal.
  • Recovery should succeed if some of the above failures happen during recovery.

Recovery may fail after:

  • Memory corruption;
  • Segment or log corruption;
  • I/O error in segment or log;

Dependencies

Recovery depends on these BE components:

  • log;
  • engine;
  • tx_group;
  • seg0.

Logical Specification

Component Overview

The following BE subsystems are involved into recovery process, so they have to be updated according to the list:

  • Engine starts recovery process inside m0_be_domain_start(). Recovery process is started every time engine starts. It analyses special bits inside seg0 and starts re-applying groups of transactions stored in Log to make filesystem stored in be segments consistent.
  • Log stores groups of transactions in order to perform durability of BE storage. For recovery process, it has to provide interface or functions for efficient iteration of the groups stored inside. Algorithm has to iterate groups from last placed to last logged.
  • Seg0 is a storage of metadata related to the filesystem. Recovery needs to store inside seg0 special bit, analysing which it makes a decision to start log scanning and groups re-applying. These bit (dirty bit) have to be set or cleared with a special (non-transactional) procedure and it can be stored in struct m0_be_seg_hdr nearby the start of the segment. In future, some redundancy can be added: dirty bit can be stored in different places of the segment.
  • Recovery encapsulates a set of algorithms and data structures and formats which are needed to make data stored inside BE segments consistent during failures.
  • From recovery point of view PageD are special interfaces which have to be used to apply changes to in-memory representation of the segments. Using m0_be_reg_{get,put}() interface recovery loads the page, where data corresponding to the scanned group region area lives to apply changes to the region.

Component Subroutines

Scanning algorithm which finds last logged group and last placed group Log header contains pointer to a valid logged group. Scanning algorithm begins from the pointed group. Groups are scanned in both forward and backward directions. Lsn must follow with increasing order, therefore scanning algorithm repeats while lsn is increased for the forward scanning and while lsn is decreased for backward scanning respectively. Last logged group must be the last handled group during the forward scanning. It contains pointer to last placed group.

Backward scanning is stopped if the following condition is met: lsn of the current group is less than lsn of the last placed group.

Note: BE log operates with a log record that is representation of transactional group in the log.

Iterative interface for looking over groups that need to be re-applied Recovery provides interface for pick next group for re-applying.