The Case for Data Centric Hyperconvergence

In this blog post I would like to introduce the notion of data-centric hyperconvergence. Why should someone care about it? In two words: cost reduction. In few more words: In the last decade the ability to store data grew 10 times more than the ability to move data. It grew 20 times more if we consider the last 5 years only (details below). Thus, moving stored data (even within the data center) for the purpose of processing it becomes too costly if not impossible. The above mentioned storage efficiency growth is driven from the need to store more data. Apparently, the type of data with the largest growth is unstructured data. Unstructured data refers to a blob of bits whose structure is unknown to the storage system where it is stored. Think images, videos or audio files. Other types of data are structured data and semi-structured data. Structured data is a ‘strongly typed’ data having a descriptive schema. This is typically what we keep in relational databases. Semi-structured data is basically text having some loose structure (e.g. log files or time series coming from IoT sensors). It is important to note that structured data is estimated as 20% of the overall data being stored, while the rest of the 80% are semi-structured and unstructured data.

What is Data-Centric Hyperconvergence?

‘Traditional’ hyperconvergence usually refers a horizontally scalable infrastructure that consists of storage, compute and networking having an integrated management. Data-centric hyperconvergence is no different with this respect, only it solves a different problem: the need to move massive amounts of data.

Storage and Compute

Lets start with the compute part of it. A data-centric computation is a computation that is typically co-located with a designated piece of data within the same compute node. In some cases, which are described below, the compute will be one hop away from the data. We will get back to the compute part, but at this point I would like to switch to the storage part.

The best storage candidate for data centric hyperconvergence is an object store (e.g. Amazon S3). This is mainly due to their massive scale, easy accessibility and ease of management. Here are few characteristics of object stores:

  1. Architected to store and serve 100s of PB of data.
  2. Accessible from anywhere via HTTP REST API.
  3. Typically have built-in geo-replication, erasure coding and balanced resource utilization.
  4. Implement a simple data model: the atomic unit of data that can be stored is an object: A blob of data that come together with metadata. Objects are grouped together in non-hierarchical containers (or buckets) with no hard limit on the number of objects in a container.
  5. Have rich metadata model: metadata is a way to describe the object and ranges from system metadata (e.g. object size or date of creation) to user-defined metadata (e.g. this object is a sci-fi movie or has an average loudness of 60db). As I point out next, the metadata part is crucial to data centric hyperconvergence.
  6. Multi-tenant. Typically object stores are multi-tenant, with each tenant having an object store account. An account is essentially a set of containers. Multi-tenancy makes object stores a good fit for storage as-a-service model.

Since there are no constraints on the actual type or format of the data that can be placed in object stores, there is also a wide range of computations that one might want to co-locate with the data. Thus, an object store that supports storage pools is an especially good fit for data-centric hyperconvergence. By storage pools I refer to the ability of the object store to use different types of hardware to store different data while:

  • Maintain a unified namespace for all pools, and
  • Allow each pool to scale independently of the others.

Openstack Swift’s storage policies is an example of such pools, where different containers of the same account can be assigned to different storage pools, each of which being stored using a different type of hardware and potentially different redundancy scheme. Having such pools allows using different compute to storage ratios as well as different processor types to different data. For example, one can build a pool for image processing / deep learning that is rich with GPUs, or a high/low processing data pools having many/little cores per TB of data, and so on. Storage pools essentially facilitate a rich set of data services that can be offered by a cloud provider.

Converging compute and storage means running a compute inside the object store. This calls for some sandboxing technology as:

  1. Resource isolation is crucial for the storage system to continue functioning and
  2. Security-wise make sure that the computation accesses only the data that it is meant to process.

The sandboxing technology can be virtualization, containerization, or even just using a JVM.

Network

Object stores need network for several purposes. The most fundamental usage of the network is to store and retrieve objects. This includes the need to replicate or EC an object across the cluster, which in some cases can span geographical regions. Another usage is object reconstruction, which takes place when there is a data corruption whose fixing requires moving data over the network. Some object store architectures use background processes to constantly make sure data is evenly spread across the cluster (or across a storage pool). These processes are significant bandwidth consumers, especially, when making changes to the cluster layout by e.g. adding more capacity. Much of the object store management is devoted to throttling this bandwidth consumption in various ways. Data- centric hyperconvergence introduces another potential usage for the cluster internal network: one such case is when the data is erasure coded, and the whole object needs to be assembled before it can be processed. The point to make here is that networking is also in the mix, and needs to be taken into account in a unified management solution.

Data Management

Finally, data-centric hyperconvergence calls for a data management ingredient. By data management I refer to index and search capabilities over object’s metadata. The combination of compute and search by metadata is of a great potential. Think about a computation that allows enriching the metadata of an object, followed by searching for all objects that have a certain feature. For example, following a face recognition computation done during ingest, one can query for all objects having the face of a well-known figure. This Openstack talk elaborates on this idea of data management.

Data-Centric Hyperconvergence Vs. 'Traditional' Hyperconvergence

For readers familiar with HCI (hyperconverged infrastructure) and its major use cases it is quite obvious that data centric hyperconvergence and HCI are different beasts. Still, I would like to spell out some differences suggesting that unlike data centric hyperconvergence, HCI is essentially compute-centric.

  • In many cases HCI goes hand-in-hand with high end all flash storage arrays. The reason for this is the need to accommodate general-purpose workloads, which include real time analytics, and low latency transactional workloads. This implies both high storage costs as well as smaller storage capacity when compared to object stores.
  • Much of the HCI development efforts go to automated placement decisions of VMs, where to migrate if necessary, seamless migration mechanisms, etc. In the data-centric realm, the compute is simply placed where the data is.
  • HCI involves computations that are typically long lived (think VDI) and have a complex state that needs to be backed up and restored.

Being different beasts HCI and data-centric hyperconvergence can be considered as complimentary technologies, especially in the realm of big data. Data-centric hyperconvergence can be used to enrich, clean, filter & format unstructured data in a way that makes it consumable by more traditional big data machinery such as Spark / Hadoop. Some concrete examples are given below.

Storage Vs. Network Growth Rate

This section gives the details behind the storage Vs. network growth rate ratios quoted above. Take a look at the following graph.

 

The graph shows the growth factor of storage and network capacities in the last decade. Lets start with Ethernet.

  • In 2010 Ethernet was at 10Gbps. In 2014 there were already 100Gbps solution out there, and recently we start seeing 200Gbps solutions. Overall, this comes to a 20X growth factor.
  • Infiniband on the other hand was already at 56Gbps in 2011, and so, although it is now in 200Gbps the growth rate is smaller.
  • HDDs were not as fast growing as Ethernet, and their X8 factor comes form the near future anticipated 16TB helium filled hard drives (in fact Seagate announced they plan to have a 20TB helium filled HDD by 2020 but this may be an SMR which I prefer not to consider for data centric hyperconvergence).
  • Now, the real story comes from SSDs. Samsung have rolled out a 15TB SSD in 2016. At the end of 2016 Seagate announced a 60TB SSD, and Toshiba announced a 100TB SSD in 2017. This brings us to 50X growth factor in storage capacity Vs. X20 growth factor in networking. Note that considering only the last 5 years (as of 2015), we are looking at X10 growth in storage Vs. X2 growth in networking.

The point is, storage and networking “capacities” cannot be naively compared. One reason is that the number of disks and network ports in a typical server are not the same. Considering a server with 16 disks and 4 network ports (a somewhat conservative ratio in favor of network), we see that such servers grew X800 in storage capacities while only X80 in bandwidth (or X160 vs. X8 looking at the last 5 years). Another point is that in some cases it is easier to upgrade to a bigger disk than to upgrade to a faster network.

Data-Centric Hyperconvergence Solutions

There are not many solutions of data-centric hyperconvergence that I am aware of, and none of them is a fully-fledged solution. The ones I am aware of come from OpenIO, Triton Object Storage and zeroCloud.

In this blog I do not intend to make any comparison between the solutions, but rather to advocate for Openstack Swift and Storlets, that together with swift-metadata-sync coming from SwiftStack provide a data-centric hyperconvergence solution.

Storlets are implemented as a Swift ‘plugin’ that allows running computations inside Swift storage nodes. One nice feature of storlets is their serverlessness. This basically means that the end user can upload a computation as if it was just a data object. Once uploaded the user can invoke it over the data without any further server side configurations or maintenance.

Storlets use Docker to sandbox computations, and each Swift account can have its own Docker image containing whatever software stack is required for that account. Thus, if one account cares about image processing it could have OpenCV in its Docker image. If one cares about machine learning he might want python scikit-learn installed.

The storlets themselves can be written in Java or Python and it is quite easy to add more ‘language bindings’.

 Two recently presented use cases of storlets are given below, and the storlets documentation can be found here.

  1. Spark and storlets shows how to boost Spark SQL queries by offloading filtering work to a Swift and Storlets.
  2. Machine learning with Storlets shows data preparation and data tagging using computer vision and machine learning techniques. Also featured is a nice integration of storlets with Jupyter notebooks.

 

Storlet: 
1

Comments

Hi,

Congrats for this very detailed and informative blog post.

Probably one of the best blog post about the Data Centric revolution that is coming I ever read.

You have my email address, would love to hear from you.

Cheers.

Add new comment