分布式系统

论文笔记

MapReduce

GFS

Operations

操作支持:We support the usual operations to create, delete, open, close, read,andwrite files. Moreover, GFS has snapshot and record append operations. (快照 和 记录添加操作)

master

master存储信息:The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks.

Metadata

metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas.

chunk namespaces, the mapping from files to chunks – 持久化保存,通过日志进行复制备份,防止宕机

chunk location information – 不持久化保存,master启动时获取信息,所以宕机也没事 (request the data from chunkservers at startup, and periodically thereafter.)

master内存瓶颈问题:If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price; 64KB 保存 64MB的chunk,但是一般用不到64KB

Chunk Locations

chunk location information : 启动时获取,此后通过 HeartBeat Message保持;

原因:

1、This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. (保持master 和 chunkServer的一致性,当遇到问题:加入、离开、改变name,失败,重启等问题)

2、理解这种设计决策的另一种方法是认识到一个块服务器对它在自己的磁盘上做什么或没有什么块拥有最终的决定权

Operation Log

目的:记录信息和 并发操作 :Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations.

checkpoints

cache

Client 和 ChunkServers 不缓存文件。

原因:Client 无法消除缓存一致性问题

ChunkServers :本地保存,Linux的buffer已经有频繁访问的数据 (部分已经缓存)

read/write : Clients never read and write file data through the master

ChunkSize : 64MB

优点:

1、reduces clients’ need to interact with the master

2、reduce network overhead by keeping a persistent TCP connection

3、reduces the size of the metadata stored on the master.

缺点:

1、hot spots :小文件被高频访问; 解决办法:our applications mostly read large multi-chunk files sequentially.

2、a batch-queue system:大量请求同时请求一个文件; 解决方案:We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. (错开请求时间)

Consistency Model

Guarantees by GFS – 保证

data mutation

The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. (文件区域的状态取决于 变异的类型 define / undefine)

A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.(如果数据变异是 一致的,这个区域是 定义状态)

the mutated file region is guaranteed to be defined and contain the data written by the last mutation

Read流程图

image-20230506083337124

快照

1
2
3
4
5
6
7
GFS 和 DynamoDB 设计快照的目的是为了提高数据持久性和灾难恢复方面的能力,并非仅仅为了解决并发读写的问题。

对于 GFS 来说,它主要被设计用来存储和处理 Google 的广告相关数据,具有高可靠性、高可用性、高带宽和数据一致性等特征,因此它的快照功能主要是为了确保数据能够在出现如机器故障、网络故障、软件故障等问题时能够快速、可靠地进行备份和恢复。

对于 DynamoDB 来说,它是一种高度可扩展的 NoSQL 数据库,支持无缝扩展、高吞吐量、低延迟和数据持久性等要求,而快照功能可以用于备份、数据恢复、归档、版本控制等操作。

虽然快照能够提高数据的读写效率,但并不是它们的设计核心。毕竟在现代计算机中,通过硬件和软件的不断优化和进步,已经可以实现高效地处理大规模的并发读写操作了。

Raft

Memcache At FaceBook

MIT6.824 Facebook的Memcache - 知乎 (zhihu.com)


分布式系统
http://example.com/2023/06/01/分布式系统-大数据/论文/论文笔记/
作者
where
发布于
2023年6月1日
许可协议