Skip to main content
  1. OSS/
  2. Ozone/

Ozone Introduction

·9 mins· ·
Blog En Ozone Oss
Table of Contents

Overview
#

1742375991588

1742376029768

Ratis
#

It was initially written for Ozone.

Ozone Manager (OM)
#

https://ozone.apache.org/docs/edge/concept/ozonemanager.html

Storage Container Manager (SCM)
#

https://ozone.apache.org/docs/edge/concept/storagecontainermanager.html

Containers
#

Containers are the fundamental replication unit of Ozone/HDDS, they are managed by the Storage Container Manager (SCM) service.

Containers are big binary units (5Gb by default) which can contain multiple blocks:

1742376685050

Blocks are local information and not managed by SCM. Therefore even if billions of small files are created in the system (which means billions of blocks are created), only of the status of the containers will be reported by the Datanodes and containers will be replicated.

When Ozone Manager requests a new Block allocation from the SCM, SCM will identify the suitable container and generate a block id which contains ContainerId + LocalId. Client will connect to the Datanode which stores the Container, and datanode can manage the separated block based on the LocalId.

Open vs. Closed containers
#

When a container is created it starts in an OPEN state. When it’s full (~5GB data is written), container will be closed and becomes a CLOSED container.

The fundamental differences between OPEN and CLOSED containers:

OPENCLOSED
mutableimmutable
replicated with RAFT (Ratis)Replicated with async container copy
Raft leader is used to READ / WRITEAll the nodes can be used to READ

DataNodes
#

When a client wants to read a key from Ozone, the client sends the name of the key to the Ozone Manager. Ozone manager returns the list of Ozone blocks that make up that key.

An Ozone block contains the container ID and a local ID. The figure below shows the logical layout of the Ozone block.

1742376694749

The container ID lets the clients discover the location of the container. The authoritative information about where a container is located is with the Storage Container Manager (SCM). In most cases, the container location will be cached by Ozone Manager and will be returned along with the Ozone blocks.

Once the client is able to locate the container, that is, understand which data nodes contain this container, the client will connect to the datanode and read the data stream specified by Container ID:Local ID. In other words, the local ID serves as index into the container which describes what data stream we want to read from.

https://ozone.apache.org/docs/edge/concept/datanodes.html

Components
#

Ozone Manager (OM)
#

Metadata store in rocksdb

Double Buffer
#

1742376072187

1742376077700

  • preExecute() executed at RPC context level. Generate a OMRequest object with all the required information, which will be passed through RatisClient by calling sendAsync(). All the object information that will eventually be persisted to RocksDB and returned to the user must be generated in this step.
  • `validateAndCreateResponse() `` Executed in stateMachine applyTransaction. This should be an idempotent operation. This is a very light weight operation which will be executed on leader and follower OM. In general the following steps will be done in this method.
    1. Acquire lock.
    2. Validate the request can be executed E.g. check that the target volume and bucket exist, perform authorization checks.
    3. Create a response object for the request to be added to double buffer.
    4. Add the response information to TableCache. This makes the result of the operation visible to subsequent requests.
    5. Release lock.

Leader side Execution (to replace Double Buffer)
#

1742376093790

[!Note] volume/bucket level lock -> key prefix lock

Storage Container Manager (SCM)
#

Architecture
#

What does SCM manage?

1742376120850

How does it manage?

1742376131632

Metadata
#

SCM RocksDB Tables
#

Pipeline Table

ozone debug ldb --db=/data/metadata/scm.db value-schema --cf=pipelines | jq '.Pipeline |  keys'

Container Table

ozone debug ldb --db=/data/metadata/scm.db value-schema --cf=containers | jq '.ContainerInfo |  keys'

Deleted Blocks Table

ozone debug ldb --db=/data/metadata/scm.db value-schema --cf=deletedBlocks | jq '.DeletedBlocksTransaction |  keys'

Services
#

1742376177353

Node Lifecycle
#
1742376189659
1742376198093

[!Note] This setup has the following configuration changes

ozone.scm.stale.node.interval is 1m
ozone.scm.dead.node.interval is 2m

You can check the configs in http://127.0.0.1:9876/#!/config

  • What is HEALTHY_READONLY state?

    • Used during upgrade
    • The datanodes are moved to this state when SCM is finalized but the datanodes are yet to be finalized.
  • What happens when a datanode is marked as stale?

  • What happens when a datanode is marked as dead?

Operational State?
#
  • IN_SERVICE
  • ENTERING_MAINTENANCE
  • IN_MAINTENANCE
  • DECOMMISSIONING
  • DECOMMISSIONED

Operational State along with Health State will give the overall state of the datanode.

Pipeline Lifecycle
#

1742376210653

Types of Pipelines

  • Ratis
  • EC
  • Standalone

Standalone Pipeline is used for reads (on Ratis Containers).

Pipelines creation is a costly operation, so they are created in background and are reused.

Pipeline will be closed only if

  • there is a write failure on the Pipeline.
  • there is a slow follower in the Pipeline Ratis group.
  • one of the datanode in the pipeline is marked as stale.
  • the pipeline close command is executed by admin.
Pipeline Creation Flow
#
  • Based on the Rack Awareness logic, the datanodes are picked for the Pipeline creation
  • The Pipeline is created on SCM and it’s moved to ALLOCATED state
  • The SCM will send the Pipeline creation request to the datanodes that are part of the pipeline
  • The datanodes will create the pipeline and send the response (Pipeline report) back to SCM via heartbeat
  • The SCM will move the Pipeline to OPEN state once it received the Pipeline report
  • Now the Pipeline can be used for Container allocation.
  • The Pipelines which are in ALLOCATED state for a long time will be deleted by the scrubber.

Following are the ways in which Pipelines can be created

  • Background Pipeline Creator
  • We will also created a Pipeline if there is a write request and we don’t have any open Pipelines
  • Using create Pipeline command
ozone admin pipeline create
Pipeline Scrubber
#

Pipelines in allocated state for long time will be removed by scrubber The timeout can be configued using ozone.scm.pipeline.allocated.timeout, the default value is 5 minutes.

Pipeline Close Flow
#
  • The Pipeline is moved to CLOSED state
  • We give some time for containers to get gracefully close before we delete the Pipeline.
    • This is done to avoid moving the containers to QUASI_CLOSED state.
  • After ozone.scm.pipeline.destroy.timeout the Pipeline is deleted from SCM and Delete Pipeline Comands are sent to Datanodes.

Pipeline Action from Datanode Log to grep for in Datanode

LOG.error("pipeline Action {} on pipeline {}.Reason : {}",
           action.getAction(), pipelineID,
           action.getClosePipeline().getDetailedReason());
What is DORMANT state?
#
  • This state is introduced for troubleshooting and debugging purposes.
  • If you close a Pipeline, the datanodes that were part of the Pipeline becomes available and the Background Pipeline Creator will try to create a new Pipeline.
  • In situations where you want to restrict your container writes to a specific set of datanodes (Pipelines), you can deactivate the other Pipelines in the cluster.
  • Deactivating a Pipeline will move the Pipeline to DORMANT state.
  • Pipelines in DORMANT state will not be closed or deleted, at the same time they are not used for writes.
  • Once you’re done with your troubleshooting, you can activate the Pipeline and it will be moved to OPEN state.
Container Lifecycle
#
Container State Transition
#

1742376220605

Container State Flow
#

1742376227414

Container creation logic Clients create the container. The containers are directly moved to OPEN state in SCM upon creation. There is no allocated state for containers.

The number of OPEN containers in a pipeline is controlled by ozone.scm.pipeline.owner.container.count.

QUASI_CLOSED state?
#
  • When the container is not closed via Ratis, one of the replica could be lagging behind.
How are QUASI_CLOSED containers closed?
#

Replication Manager will try to retain at least one replica per unique Origin Node ID.

Block Management
#
Block Allocation
#
  • On Open Containers
  • Sequence ID generator
  • Only done on leader
Block Deletion
#
  • Sent only for closed containers
  • The datanode marks the block for deletion and sends the ack
  • Deletion ack should be received from all the replicas (three datanodes)

Note: Recent improvement is to send block delete command only if the contianer has three healthy replicas.

There is a delete Block retry count. If the retry count exceeds the configured value, SCM will not retry the block deletion.

Replication Manager
#

By default, it runs every 5 minutes, checking which Containers are under-replicated or over-replicated based on node operation state & health state, and then performs corresponding container replica copy/delete operations accordingly.

The current drawback is that when node state changes occur, it takes 5 minutes before they can be detected. PR: HDDS-8660. Notify ReplicationManager when nodes go dead or out of service may solve this problem.

SafeMode Behaviour
#

OM Operations
#
Client OperationExpected Behaviour
Volume CreateShould Succeed.
Bucket CreateShould Succeed.
Volume ListShould Succeed.
Bucket ListShould Succeed.
Key/File ListShould Succeed.
Key/File WriteShould fail with Safemode Exception
Key/File ReadIt should succeed if at least one DN holding the replica is registered.
SCM Admin Operations
#
Admin CommandExpected Behaviour
Close ContainerShould Succeed. The container will be moved to Closing state
CreatePipelineShould fail with Safemode Exception
GetPipelineShould Succeed.
ozone admin safemode status --verbose

Check SCM Web UI for details.

Problem
#
  • Try to put cluster into safemode and verify the client commands.

Disk Layout
#

1742376242216

The version file

cat /data/metadata/scm/current/VERSION

Parse Raft Log file

ozone debug ratislogparser scm -s=/<Ratis Dir>/<Raft Group ID>/current/log_inprogress_1

High Availability
#

1742376253250

Datanode Heartbeat Protocol
#

/**
* This message is send by data node to indicate that it is alive or it is
* registering with the node manager.
*/
message SCMHeartbeatRequestProto {
  required DatanodeDetailsProto datanodeDetails = 1;

  optional NodeReportProto nodeReport = 2;
  optional ContainerReportsProto containerReport = 3;
  repeated IncrementalContainerReportProto incrementalContainerReport = 4;
  repeated CommandStatusReportsProto commandStatusReports = 5;
  optional ContainerActionsProto containerActions = 6;
  optional PipelineActionsProto pipelineActions = 7;
  optional PipelineReportsProto pipelineReports = 8;
  optional LayoutVersionProto dataNodeLayoutVersion = 9;
  optional CommandQueueReportProto commandQueueReport = 10;
}

Some Features
#

Streaming write with Pipeline
#

For the datanode-side, the ContainerStateMachine in Ozone already has implemented all the methods for the Ratis Async API. For Ratis Streaming, ContainerStateMachine should also implement the stream(..) method and the link(..) method in Ratis StateMachine.DataApi.

The stream(..) method is to create a StateMachine.DataStream for receiving incoming stream data from the client. Ozone should implement StateMachine.DataChannel, which extends WritableByteChannel, for writing incoming stream data to its local storage.

The link(..) method is to link the incoming stream with a Ratis log entry. Since the ContainerStateMachine already can handle WriteChunk requests for the Async API, the link(..) should use similar code to implement.

Note that the Ratis RaftServer handles the incoming network traffic. No change is required for receiving the Ratis Streaming requests.

Routing Table
#

topology aware

1742376263088

Ozone Realtime Cross-Region Bucket Replication
#

Related

Building a high-performance, scalable server for AD management
·13 mins
Blog En
peterxcli/ad-server Scalable state machine, exemplified through an advertisement management system | Dcard 2024 internship assignment | Distributed Systems, ensuring high availability and fault tolerance Go 5 1 Benchmark Result # Local: QPS: 96000/s K6 Load Test The primary bottleneck is the gin router. With an optimized router engine, the QPS could potentially reach 800000/s. - See gin router performance benchmarks. Short Description # A infinitely scalable. advertisement management system, baked with replicated advertisement business state machine, replicated log system, and fault recovery mechanism.
Bikefest 2024 Backend
·4 mins
Blog En
peterxcli/bike-festival-2024-backend Backbone for the 2024 NCKU Bike Festival, featuring event notifications and Line Login for authentication. Built with Go, Docker, and Redis. Go 1 0 DB Schema # User Table # Field Type GORM Options JSON Key Description ID string type:varchar(36);primary_key id The unique identifier for the user. (from line account) Name string type:varchar(255);index name The name of the user. (from line account) User-Event Table # Field Type Description user_id varchar(36) The ID of the user, linking to User.