The CPU Network Tradeoff

Should data be brought to the compute or vice versa?

This is an old question in the 'storage centric' workload realm. While being old, I am not aware of any cost model that tries to answer that question in terms of price, that is: What costs more? Bringing the data to the compute node or placing more CPU power where the data resides? Be it my ignorance or the actual non existance of such model - here is a simple pricing model that attempts to answer that question.

As one might expect this is going to be workload dependent; that is, given a certain workload would it be cheaper to bring the data to the server running the workload or place more CPU power where the data is stored. The model makes use of a so-called “CPU per GB” metric of a workload (CGW in short). CGW is the time in seconds it takes the workload W to process 1GB worth of data, when running on a single 100% utilized core. Note that GC is measured in [Sec/GB]. To determine the actual GCW of a workload W, one should probably take an avarage over a mixture of representative input data of W.

Keeping GC in mind, lets define some basic units that will help us compare prices:

  • Core/Sec price. This unit tells us what the price of a CPU second is. A simple way to get a number here is to take the price of a given CPU divide it by the number of cores it has (counting hardware threads) and by the number of seconds in 3 years. The actual number of years is not very important. To make the model more accurate we may want to throw in the power usage, but lets leave that for now. The unit is measured in [$/Sec]. We will use CS to denote this unit.
  • Pipe/Sec price. This unit tells us how much it costs per second to move bits from one machine to another. Moving data from one machine to another requires at least two host ports and two switch ports. To get a price of a switch port per second, we take the price of a switch divide it by its number of ports, and by the number of seconds in 3 years. Similarly, we can get the price per second of a host port. We will use PS to denote this unit.

One more unit that we will need is the network coefficient or NC in short. This coefficient represents the network speed normalized to GB (Giga Bytes). In other words, given a data size in GB, we want to know how much time in seconds it takes to move it. For 10Gbps Ethernet, this would be 8/10 times the data size in GB. Note that the units of the NC are [Sec/GB] (it takes 8/10 seconds to move 1GB worth of data over a 10Gbps link): We multiply by 8 as the network speed is given in bits (10Gbps), and we are measuring data in Bytes. We divide by 10, again as the speed is 10Gbps.

Given all these units, we can now spell out the answer we are looking for. Given a workload and an input size:

  • The CPU price to run the workload W over the data is: GCW[Sec/GB] X size [GB] X CS [$/Sec]
  • The network price to run the workload is: NC[Sec/GB] X size [GB] X PS [$/Sec]

To get a better feeling of the model, I am adding some examples. The actual numbers should really be taken with a grain of salt for various reasons listed below. Again, the idea is really to get a feel of the model. The workload in question is a simple grep. More precisely we run ‘time grep FRA data.csv > /tmp/example’, where data.csv is a 130MB CSV file. 130MB are ~0.127GB. Thus, taking the (real) time we got from the above command and dividing it by 0.127 we get the below GCWs values below. See this spreadsheet for all the arithmetics.

  • Intel Core i5 2.8 GHz – 4.61[Sec/GB]
  • Intel E5-2680 v3 @ 2.50GHz – 1.69[Sec/GB]
  • Broadcom ARM11 processor – 3 [Sec/GB]

Using Internet prices for the cores, and taking into account the number of hardware threads in each CPU we get the $/Sec prices of each CPU (see spreadsheet). Also, I could not find the price for ARM11 (of any manufacturer), so the quoted price in the sheet is for the whole PI board I used. For the Pipe/Sec price I used 10Gbps quotes for Cisco WS-C4948-10GE-S Catalyst 4948-10GE 48 Port Switch, and Intel Ethernet Converged Network Adapter X540-T2 Pci Express 2.1 X8 Low Profile Putting all the numbers into the spreadsheet we get the following prices. The network price should be read as the price to move the bits, and the CPU prices as the prices to run this certain workload on each of the CPUs.

To state the obvious: for our toy "grep FRA" example, it is cheaper to put more CPU power near the data then move it to the compute. 

So what is wrong with the numbers: At the end of the day other than the CPU there are many other differences between the systems (RAM type, storage device where the data is kept, etc.) While some of the differences can be mitigated (e.g. using RAM disk) they will always be there. On the other hand we are not after comparing different CPUs as we are comparing a given system with a given CPU to a given network setup. 

Some random thoughts

  • Cores and network scale differently: It is much easier to add network ports than cores to a compute node. Also, cores do not 'scale linearly'.

  • The way the network price is calculated is really a ‘lower bound’:

    • Typically, the number of network hops for the data to be read from the storage system is two: a client facing layer and storage layer.

    • Calculating the network time was done according to network speed and data size not taking e.g. TCP overhead.

  • The model can be easily extended to take into account power. Take the socket power consumption and divide by the number of hardware threads. Take the switch power consumption divided by the number of ports. Multiply the resulting numbers by the electricity price at your favorite location normalized to seconds.

  • Any storage system must be connected via a network to the outside world (if only to store the data in the first place). Thus, if we wish to show that it is more economic to place more CPU power in the storage system it is not enough to show that CPU is cheaper than network (in the sense that the model shows). One needs to consider the fraction of the network that is used for reading data for the purpose of analysis. In fact the same holds for CPU, that is, one needs to take into account only the fraction of CPU that is used for processing (on top of what the storage system consumes). The model definition, however, defines the workload GC as the time it takes to process 1GB worth of data when the CPU is 100% utilized. Having said that, the numbers in our toy example reflect a much lower CPU utilization (which make the netwok more favorable). Anyway, taking into account the 'fraction' agrument and considering our toy example, the minimal amount of data that needs to be read for processing so that CPU is cheaper is:

    • ~56% for Intel core i5

    • ~28% for Intel E5-2680 V3

    • ~20% for ARM11

Model Motivation

The motivation behind this model is the Storlets chip bakers use case described in the Openstack Storlets documentation.

Add new comment