What is CubeFS?
CubeFS is a cloud-native storage product, being one of the "Incubator" grade projects of the CNCF ( Cloud Native Computing Foundation ). It supports several data access protocols, such as S3, POSIX, and HDFS, and supports two storage engines : multi-replica and erasure code . It provides users with multiple functions such as multi-tenancy, multi-AZ deployment, and cross-region replication, and is widely used in scenarios such as big data, AI, container platforms, databases, middleware storage and computing separation, data sharing and data protection.
Supports various access protocols such as S3, POSIX and HDFS, and access between protocols is interoperable.
- POSIX Compatible : Compatible with the POSIX interface, making application development extremely simple for upper-layer applications, as convenient as using a local file system. Additionally, CubeFS relaxed the consistency requirements of POSIX semantics during implementation to balance the performance of file and metadata operations.
- S3 Compatible – Supporting the AWS S3 object storage protocol, users can use the native Amazon S3 SDK to manage resources in CubeFS.
- Compatible with HDFS : Compatible with the Hadoop FileSystem interface protocol, users can use CubeFS to replace the Hadoop file system (HDFS) without affecting the upper layer business.
By supporting two engines: multi-replica and erasure coding, users can flexibly choose according to their business scenarios.
- Multi-replica storage engine : Data between copies is in a mirror relationship, and data consistency between copies is ensured through a highly consistent replication protocol. Users can flexibly configure different numbers of copies according to their application scenarios.
- Erasure Coding Storage Engine ：The erasure coding engine has the characteristics of high reliability, high availability, low cost, and supports ultra-large scale (EB). According to different AZ models, erasure coding modes can be flexibly selected.
Support multi-tenant management and provide detailed tenant isolation policies.
You can easily build distributed storage services with PB or EB level scale, and each module can be scaled horizontally.
CubeFS supports multi-level caching to optimize access to small files and supports multiple high-performance replication protocols.
- Metadata Management - The metadata cluster uses in-memory metadata storage and uses two B-trees (inodeBTree and dentryBTree) to manage indexes and improve metadata access performance.
- Strong consistency replication protocol ：CubeFS adopts different replication protocols according to the file writing mode to ensure data consistency between replicas. If the file is written sequentially, the primary-backup replication protocol is used to optimize I/O performance. If the file is randomly written to overwrite the contents of the existing file, a Multi-Raft based replication protocol is used to ensure high data consistency.
- Multi-level caching – Erasure coding volume supports multi-level caching acceleration capability to provide higher data access performance for hot data:
- Local Cache: The BlockCache component can be deployed on the client machine as a local cache using the local disk. It can directly read the local cache without going through the network, but the capacity is limited by the local disk.
- Global Cache: A distributed global cache created with the DataNode replication component. For example, a DataNode with an SSD deployed in the same data center as the client can be used as a global cache. Compared with the local cache, it needs to go through the network, but it has a larger capacity and can be dynamically scaled, and the number of replicas can be adjusted.
Based on the CSI (Container Storage Interface ) standard, CubeFS is easily integrated and deployed in Kubernetes.
Use cases and scenarios
As a cloud-native distributed storage platform, CubeFS provides multiple access protocols, so you can use it in different situations, let's list:
Big data analysis
Compatible with the HDFS protocol, CubeFS provides a unified storage foundation for the Hadoop ecosystem (such as Spark and Hive), providing unlimited storage space and high-bandwidth data storage capabilities for computing engines.
Deep Learning/Machine Learning
As a distributed parallel file system, CubeFS supports AI training, model storage and distribution, I/O acceleration, and other requirements.
Shared storage between containers
The container cluster can store the configuration files or initialization load data of container images in CubeFS, and read them in real time when batch loading containers. Multiple PODs can share persistent data through CubeFS, and rapid failover can be performed in case of POD failure.
Databases and middleware
Provides high-concurrency, low-latency cloud disk services for database applications such as MySQL, ElasticSearch, and ClickHouse, achieving complete separation between storage and compute.
Provides high-reliability, low-cost object storage services for online businesses (such as advertising, clickstreams, and searches) or end-user graphic, text, audio, and video content.
From NAS (Network Attached Storage) to the cloud
It replaces traditional local storage and offline NAS and facilitates cloud adoption strategies.