Machine Learning with Storlets
While working together with the IOStack folks on a paper that shows how storlets can be used to boost spark SQL workloads, two of my colleagues, Raul Gracia Tinedo and Yosef Moatti brought the idea of doing machine learning with Storlets. Out of sheer ignorance I was against the idea. I was wrong and in more than one way. Now, that I am a little less ignorant about machine learning, I can say that storlets can be useful for machine learning in several ways which I describe in this blog.
One ‘natural’ way to use storlets for machine learning is for data preparation. Examples range from creating synthetic data based on real data to taking raw data and make it ‘presentable’ to a machine-learning algorithm. In a recent demo that features face recognition deep learning with Storlets, I showed how storlets are used to extract faces from pictures and format them so that they can be fed into a learning algorithm. Specifically, I wrote an extract_face storlet:
Using OpenCV, the storlet finds a bounding box for the face, crops that bounding box and resizes it to a fixed size. The same fixed size is used across all pictures in the training set. Resizing is of particular importance as the learning algorithm ‘anticipates’ that different data points in the training set are of the same ‘format’. Clearly, different pictures feature different sizes of faces. While those operations are fundamental in OpenCV, and there is nothing new about them, embedding those operations in a storlet and invoking the storlet using copy semantics, one can prepare a very large training set without a single bit needing to enter or leave the storage system. Using copy semantics with storlets essentially ‘copies’ object A into B, only instead of having identical copies, B is the storlet’s computation result over A.
Another way to use storlets in the context of machine learning is to apply an already trained model over existing data in the object store. In that same demo I showed how a facial recognition trained model is be used to tag frames in a video with the name of the person appearing in that frame. This was done using a recognize face storlet leveraging OpenCV and scikit-learn as shown below:
The recognize face storlet takes two input objects: (1) A trained model, and (2) a video featuring a person that appeared in the training set. The storlet uses the same procedure described above to extract the face. This is done on each frame of the video, and the extracted face is presented to the trained model. The model returns its classification result. Again, using the copy semantics the storlet can create a tagged version of the video with a name appearing on each frame without any external communication. Another way to use the model would be to use a storlet that would enrich the video with a user defined metadata carrying the name that appeared the maximum number of times across all frames.
Finally, the demo also shows how storlets can be used to train a neural network using a training set:
The storlet takes all the extracted face objects as inputs and outputs a trained model. Each of the extracted faces carries a metadata key with a name from which the storlet builds a supervised learning training set and presents it to a neural network-learning algorithm.
While it is arguable that deep learning is a good fit for storlets as typically it is more compute than data intensive, other types of learning do seem a good fit. One such type of learning is Stochastic Gradient Descent (SGD) or more precisely mini batched SGD. In mini batched SGD the data is being presented to the algorithm in small batches. That is, the algorithm learns iteratively, where in each iteration, a new portion of the training data is being used. This type of algorithm facilitates learning of huge amounts of data that cannot fit at once into memory. When using SGD, special care should be taken to randomly shuffle the data before it is fed into the algorithm as order may make a difference. Also, sometimes one needs to use the same data to the algorithm more than once. mlstorlets is an initial implementation of SGD regression and classification classes using storlets. The repo is based on Python’s scikit-learn library and provides Python classes for regression and classification that can be trained using both local and remote data that reside in Swift. Each class supports the methods:
The fit method is scikit-learn usual method for training the model on a given training set. remote_fit, however, takes an object path in Swift as a parameter. The idea is that once invoked on a remote data object, the state of the model is serialized, and sent as a parameter to a storlet that creates an instance of the model and feeds it with the data specified in the given path. Once the remote learning is done the updated model is serialized back transparently, updating the local instance on which remote_fit was called.
To summarize, doing machine learning takes much more than running a learning algorithm. In fact most of the work in real systems that do machine learning has to do with data preparation in large scales. This is where storlets can probably be of most use. However, storlets can also be of use where the training data set does not fit into memory and batched SGD needs to be used.