Partitioning is necessary when the size of data exceeds the capacity of a single machine. By distributing the workload on each node, the performance is expected to boost by the coefficient manner. In this project, Modist, I explored the structure and functionalities of a distributed system with sharding/ partitioning capability, and received some inspiration from the unit test process.
Implementation
* Due to ethical consideration, this repo will remain private util the end of Spring 2023 semester
Modist is a client-server structured system that manage multiple functionalities and interactions between nodes with multiple server process. The client node communicate with routers to get information for future TCP connections and interact with partitioners to locate and replicate the key information. The cluster of all server side nodes is called Orchestrator, which configure each node and build up gRPC connections between them.
The connection between nodes are passed through protocol buffers. Protocol buffers are a data format to transfer binary representations of objects across the network, which lead to a higher throughput. Within each buffer, a message is passed to node. Protobuf messages are aggregates containing a set of typed fields:
- resolveableKV, message key with vector clock
- tas, required syntactic sugar used by Protobuf to order these fields in the binary format of the message
- Standard data types (bool, int, etc.) are available as well as keywords like optional and repeated
Clock and Conflicts
Modist take both physical and vector clock as its synchronize tool. Within each node, a resolver is implemented to handle the conflict when to incoming events holds different vector clocks. By definition, the highest clock value on each node will be kept as the latest time marker. The vector clock also handles new event by define empty value as zero, and assign them as concurrent. Modist using these mechanisms to acquire the happen-before relationship between events, and eventually keep a causal consistency.
Leaderless Replication
Leaderless replication is a safety guarantee when one node holding data is done. In Modist, a leaderless replication is applied when a request is made. The process of writing or reading from the replication is called a Quorum. While the client contact the Orchestrator, it use matrices R and W to indicate the number of replication that will participate the operation. Normally, for a given n node system, having W + R > n can guarantee that the replicated key can be read when needed.
To reach eventual consistency among nodes, read repair is required. By repair we mean to update the out-dated value on a node to the latest if detected. Therefore, the deterministic will be the vector clock. Coordinator detects stale values on other nodes and locally when it issues a read request, can then correct these values with the most up-to-date value.
Unit Test
Unit tests are conducted along with the development process of each feature. The goal is to cover the codes as much as possible so all features and operations can be tested without any bug. While composing the testing files, some special cases inspired me and my future unit testing and program designing.
Adoptability of nil value
Nil value can be tricky when it comes to adoptability. In some cases you just need to treat them as zero, while in other cases its an issue you need to take care of. When it comes to value comparison, recognize them as zero can be a feasible approach. You just need to add some mechanism that assign empty value to zero when read them, and it will be all set for comparison.
However, when it becomes to the collecting keys, the nil situation should be treat specially. For example, in Modist, when you trying to read a key from a node that didn’t receive a replication, you can make up a fake clock and return value to indicate the missing value. Fake clock can be acquired by the request clocks, therefore you won’t be worry about mistaken conflicts. Moreover, you need a ignorance mechanism when a nil is detected. Some data structures naturally handles these situations, or you can add additional statement to process only when it return message has a value in it.