多活 MDS 守护进程的配置¶
也叫: multi-mds 、 active-active MDS
每个 CephFS 文件系统默认情况下都只配置一个活跃 MDS 守护进程。在大型系统中,为了扩展元数据性能你可以配置多个活跃的 MDS 守护进程,它们会共同承担元数据负载。
什么情况下我需要多个活跃的 MDS 守护进程?¶
当元数据默认的单个 MDS 成为瓶颈时,你应该配置多个活跃的 MDS 守护进程。
增加守护进程不一定都能提升性能,要看负载类型。典型地,单个客户端上的单个应用程序就不会受益于 MDS 守护进程的增加,除非这个应用程序是在并行地操作元数据。
通常,有很多客户端(操作着很多不同的目录时更好)时,大量活跃的 MDS 守护进程有利于性能提升。
MDS 活跃集群的扩容¶
每一个 CephFS 文件系统都有自己的 max_mds 配置,它控制着会创建多少 rank 。有空闲守护进程可接管新 rank 时,文件系统 rank 的实际数量才会增加,比如只有一个 MDS 守护进程运行着、 max_mds 被设置成了 2 ,此时不会创建第二个 rank 。
把 max_mds 设置为想要的 rank 数量。在下面的例子里,
ceph status 输出的 fsmap 行是此命令可能输出的结果。
# fsmap e5: 1/1/1 up {0=a=up:active}, 2 up:standby
ceph fs set <fs_name> max_mds 2
# fsmap e8: 2/2/2 up {0=a=up:active,1=c=up:creating}, 1 up:standby
# fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
新创建的 rank (1) 会从 creating 状态过渡到 active 状态。
灾备守护进程¶
即使拥有多活 MDS 守护进程,一个高可用系统仍然需要灾备守护进程来顶替失效的活跃守护进程。
因此,高可用系统的 max_mds 实际最大值比系统中 MDS 服务器的总数小一。
为了在多个服务器失效时仍能保持可用,需增加系统中的灾备守护进程,以弥补你能承受的服务器失效数量。
减少 rank 数量¶
减少 rank 数量和减少 max_mds 一样简单:
# fsmap e9: 2/2/2 up {0=a=up:active,1=c=up:active}, 1 up:standby
ceph fs set <fs_name> max_mds 1
# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
# fsmap e10: 2/2/1 up {0=a=up:active,1=c=up:stopping}, 1 up:standby
...
# fsmap e10: 1/1/1 up {0=a=up:active}, 2 up:standby
集群将会自动逐步地停掉多余的 rank ,直到符合 max_mds 。
See CephFS 管理命令 for more details which forms <role> can
take.
注意:被停掉的 rank 会先进入 stopping 状态,并持续一段时间,在此期间它要把它分享的那部分元数据转手给仍然活跃着的 MDS 守护进程,这个过程可能要持续数秒到数分钟。如果这个 MDS 卡在了 stopping 状态,那可能是触发了软件缺陷。
如果一个 MDS 守护进程正处于 up:stopping 状态时崩溃了、或是被杀死了,就会有一个灾备顶替它,而且集群的监视器们也会阻止停止此守护进程的尝试。
守护进程完成 stopping 状态后,它会自己重生并成为灾备。
手动将目录树插入特定的 rank¶
在多活元数据服务器配置中,均衡器负责在集群内均匀地散布元数据负荷。此设计对大多数用户来说都够用了,但是,有时人们想要跳过动态均衡器,手动把某些元数据映射到特定的 rank ;这样一来,管理员或用户就可以均匀地散布应用负荷、或者限制用户的元数据请求,以防他影响整个集群。
为实现此目的,引入了一个机制,名为 export pin (导出销),是目录的一个扩展属性,名为 ceph.dir.pin 。用户可以用标准命令配置此属性:
setfattr -n ceph.dir.pin -v 2 path/to/dir
这个扩展属性的值是给这个目录树分配的 rank ,默认值 -1 表示此目录没有销进(某个 rank )。
一个目录的导出销是从最近的、配置了导出销的父目录继承的;同理,在一个目录上配置导出销会影响它的所有子目录。然而,设置子目录的导出销可以覆盖从父目录继承来的销子,例如:
mkdir -p a/b
# "a" and "a/b" both start without an export pin set
setfattr -n ceph.dir.pin -v 1 a/
# a and b are now pinned to rank 1
setfattr -n ceph.dir.pin -v 0 a/b
# a/b is now pinned to rank 0 and a/ and the rest of its children are still pinned to rank 1
Setting subtree partitioning policies¶
It is also possible to setup automatic static partitioning of subtrees via a set of policies. In CephFS, this automatic static partitioning is referred to as ephemeral pinning. Any directory (inode) which is ephemerally pinned will be automatically assigned to a particular rank according to a consistent hash of its inode number. The set of all ephemerally pinned directories should be uniformly distributed across all ranks.
Ephemerally pinned directories are so named because the pin may not persist once the directory inode is dropped from cache. However, an MDS failover does not affect the ephemeral nature of the pinned directory. The MDS records what subtrees are ephemerally pinned in its journal so MDS failovers do not drop this information.
A directory is either ephemerally pinned or not. Which rank it is pinned to is derived from its inode number and a consistent hash. This means that ephemerally pinned directories are somewhat evenly spread across the MDS cluster. The consistent hash also minimizes redistribution when the MDS cluster grows or shrinks. So, growing an MDS cluster may automatically increase your metadata throughput with no other administrative intervention.
Presently, there are two types of ephemeral pinning:
Distributed Ephemeral Pins: This policy indicates that all of a
directory’s immediate children should be ephemerally pinned. The canonical
example would be the /home directory: we want every user’s home directory
to be spread across the entire MDS cluster. This can be set via:
setfattr -n ceph.dir.pin.distributed -v 1 /cephfs/home
Random Ephemeral Pins: This policy indicates any descendent sub-directory
may be ephemerally pinned. This is set through the extended attribute
ceph.dir.pin.random with the value set to the percentage of directories
that should be pinned. For example:
setfattr -n ceph.dir.pin.random -v 0.5 /cephfs/tmp
Would cause any directory loaded into cache or created under /tmp to be
ephemerally pinned 50 percent of the time.
It is recomended to only set this to small values, like .001 or 0.1%.
Having too many subtrees may degrade performance. For this reason, the config
mds_export_ephemeral_random_max enforces a cap on the maximum of this
percentage (default: .01). The MDS returns EINVAL when attempting to
set a value beyond this config.
Both random and distributed ephemeral pin policies are off by default in
Octopus. The features may be enabled via the
mds_export_ephemeral_random and mds_export_ephemeral_distributed
configuration options.
Ephemeral pins may override parent export pins and vice versa. What determines which policy is followed is the rule of the closest parent: if a closer parent directory has a conflicting policy, use that one instead. For example:
mkdir -p foo/bar1/baz foo/bar2
setfattr -n ceph.dir.pin -v 0 foo
setfattr -n ceph.dir.pin.distributed -v 1 foo/bar1
The foo/bar1/baz directory will be ephemerally pinned because the
foo/bar1 policy overrides the export pin on foo. The foo/bar2
directory will obey the pin on foo normally.
For the reverse situation:
mkdir -p home/{patrick,john}
setfattr -n ceph.dir.pin.distributed -v 1 home
setfattr -n ceph.dir.pin -v 2 home/patrick
The home/patrick directory and its children will be pinned to rank 2
because its export pin overrides the policy on home.
If a directory has an export pin and an ephemeral pin policy, the export pin applies to the directory itself and the policy to its children. So:
mkdir -p home/{patrick,john}
setfattr -n ceph.dir.pin -v 0 home
setfattr -n ceph.dir.pin.distributed -v 1 home
The home directory inode (and all of its directory fragments) will always be
located on rank 0. All children including home/patrick and home/john
will be ephemerally pinned according to the distributed policy. This may only
matter for some obscure performance advantages. All the same, it’s mentioned
here so the override policy is clear.