Fork me on GitHub

H-Piper

A Deep Learning Framework for Image Pipelines in Halide




Final Write Up

H-Piper: A Image Pipeline Framework in Halide

- Lei Sun, Yang Wu

SUMMARY

H-Piper is our final project for 15418.

We proposed H-Piper, a powerful, flexible, and highly-balanced framework for deep learning Nets. User can easily generate customized image pipelines by providing config files. We tested our framework on VGG and Inception Network on latedays node. With our 10 basic layers, we were able to implemente vgg network and inception networks (incompleted). The idea of our project is given network definitions and weights, H-Piper would build the network and run forward. Not only H-Piper could be used to study the best optimization strategies for different nets, but it could also be a performance reference. We’ve tested H-Piper with VGG and Inception Net on Latedays clusters.

BACKGROUND

MXNet and Caffe are pretty popular frameworks for image processing pipelines. Both frameworks are carefully hand-tuned and have outstanding performance. Hand-tuning for image pipelines could be very painful. Changes’ correctness needs to be verified everytime and the amount of code needs to be changed for the most simple reschedule is a lot.

Halide is a Domain-Specific language designed to make it easier to write high-performance image processing code on modern machines. It provides a concept of “scheduling” which allows developers to easily define he or she wants to iterate through the dataset. Below are some schedule examples. (1. 4*4 Tile. 2. Vectorize. 3. box with x split by three. 4. tiles in parallel.)

We implemened 10 layers in H-Piper using Halide. Each layer’s schedule could be defined seperately. Given a net definition, H-Piper can create the net and load the parameters using the pool of layers. Each layer could be defined it’s own schedule seperately which allows users to explore optimization strategies. Also, new fused layer could be defined to reduce memory footprint. For instance, we could define a ‘convpool’ layer to fuse max/average pool layer after a convolutional layer. H-Piper is flexible and easy to use.

H-Piper Structure H-Piper Structure

Hand-tuning in Halide scheduler is much easier than python or cpp. However, it’s still painful and tedious when we need to account for both parallelism and locality. We’ve encountered an interesting paper [1] by Ravi Teja Mullapudi. It introduces a domain-specific language and compiler for image processing pipelines, PolyMage. PolyMage could generate approximate optimal schedule for Halide programs automatically. Below is the result where PolyMage is racing the Halide Experts from Google.

H-Piper Structure

PolyMage is another motivaiton of us to implement a framework in Halide. Google Inception Net has 156 layers. Hand tuning inception net could be the last thing we want to do. But with the help of PolyMage, we could potentially get a schedule which is approximate optimal in few minutes. To integrated with PolyMage for automatic scheduler, every layer should provide size in dimensions. However, Halide::Func does not provide such information. Therefore, we extent our implementation for this dimension-based information as well.

APPROACH

Since most image piplines could be presented by a combination of different layers, we implemeted a few baisc layers: data_layer, avg_pool_layer, concat_layer, conv_layer, flat_layer, fully_conn_layer, lrn_layer, max_pool_layer, relu_layer, softmax_layer.

Each layer has similar signature:

// `forward` is the output of this layer
Halide::Func forward(x, y, z, n);

// feed the layer with the output of the other
Layer (int...params, Layer *input);

// give output's number of dimensions
int out_dims();

// give output's size in one dimension
int out_dim_size(int i);

For example we want to build a network that fed a image for input and then continues with a convolution computation, we would just doing this in H-Piper:

// generate a data layer from image
Halide::Image<float> data = load_image();
DataLayer* data_layer = new DataLayer(h, w, ch, n, data);

Convolutional* conv_layer  = new Convolutional(int...params, data_layer);
... 

With H-Piper’s basic layers, most of image pipelines become building blocks. As long as we have correct input-output mapping and parameters defined, H-Piper would do the heavy lifting in computations for you.

Also, during our development, we accidentally trigger an assert in Halide, before we fix the bug, our workaround was spliting the kernel filter that exceed buffer limitations into 2 kernels by 4th dimension and concat together after computation. Although it was caused by a dimension error in our implementatin, but this check actually allowed us to handle different size of filters more flexible.

RESULTS

First We run a complete VGG network with H-Piper, and explore some scheduler strategy for every layer of network:
- 1. no schedule
- 2. parallel softmax layer
- 3. parallel maxpool layer
- 4. parallel conv layer
- 5. reorder conv layer
- 6. reorder + parallel conv layer

This is our result, x-axis is the schedule we were trying and y-axis is the time cost:
chart

And we have a few observations from the chart:
- Most of time in Halide pipelines are cost in computation rather in network creation. This is because in creation phase, there is no actual work being done, but define input/output of Halide::Func. And this is also a reason why fusion in Halide can be gained easily.
- Convolution layer dominates the performance of VGG, since the updates of other types of layers does not affect the performance in a similar way as convolution layer.
- HAND-TUNING IS PAINFUL. It is not true that a ‘sophisticated’ strategy would guarantee a better performance. Actually in the chart, we can see purely parallel would get worse performance. And also, there is no one-for-all general scheduler for layers. For example, parallel over channels seems to be a reasonable approach, but in case of fully-connected layer accross channels, this is not practical any more. In short, schedule must be defined based on filter size, input image and computation type.

We've also implemented all layers needed for a google inception net. Our framework would be able to take in a protobuf and create an Inception Net. The tuning for this net is crazy so we decide to use PolyMage to test this speedup once we get the access to PolyMage.

REFERENCES

[1] C. Szegedy, W. Liu, Y. Jia 2015, Going Deeper with Convolutions

[2] R. Mullapudi, V. Vasista and U. Bondhugula 2016, PolyMage: Automatic Optimization for Image Processing Pipelines

[3] ICLR 2015, Very Deep Convolutional Networks For Large-scale Image Recognition





Proposal

- Lei Sun, Yang Wu

SUMMARY

H-Piper is our final project for 15418.

We are going to build a powerful, flexible, generic, and highly-balanced framework for deep learning Nets, in Halide. We named our framework H-Piper. We will embed popular neural networks like VGG, CNN, and Google Inception Net in our framework and we enable user to define own Neural Networks by privoding a simple configuration files. We will extend our input processor to support popular inputs like protobuf, json, and etc.

If time permits, we’d like to hand-tune our framework schedule (Halide’s feature) and compete with Caffe and MXNet. Also, we could leverage the existing Halide Auto-Scheduler developed by Ravi to generate an automatic schedule to compete with Caffe and MXNet.

Our framework will be extremely useful for Halide Auto-scheduler to test their performance on different Neural Net and it will be a great performance reference for users who are interested in getting higher performance.

BACKGROUND

MXNet and Caffe are the most popular frameworks for deep learning. Both frameworks are implemented in cpp and are carefully hand-tuned. Hand-tuning in cpp could be painful. Changes’ correctness needs to be verified everytime and the amount of changes for the most simple reschedule is a lot.

Halide is a new programming language designed to make it easier to write high-performance image processing code on modern machines. It provides a concept of “scheduling” which allows developers to easily define he or she wants to iterate through the dataset. The amount to be changed is quite trivial. The correctness will be not affected heavily if only the schedule is changed.

It’s always interesting to explore the best tradeoff between parallelism and locality. With Halide, developers could explore the best tradeoff much faster with less frastration.

PolyMage focus on automatically generating high-performance implementations of image processing pipelines expressed in a high-level declarative language. “Experimental results on a modern multicore system show that the performance achieved by our automatic approach is up to 1.81× better than that achieved through manual tuning in Halide, a state-of-the-art language and compiler for image processing pipelines.”

THE CHALLENGE

  1. We need to figure out the scope of our framework. Surely we want to be general enough and support everything but given the fact that we only have couple of weeks to build this. It might not be as general as Caffe or MXNet.
  2. We need to explore the locality possibilities between two layers in any neural network. To check if we could fuse two layres together and the compare the performance.
  3. Consumer different type of inputs and provide APIs to make it easy for use to migrate to our Framework.
  4. Primitives of our framework needs to be carefully designed to allow user define a pipeline or customized neural net easily.
  5. There might be some difficulties to get auto-schedule from PolyMage since we don’t have it in control. At the end, we might have to bear with our hand-tune performance.

WORKLOAD

And we think the cache footprint would be pretty huge in convolution, although the convolution step does not have strong dependency, locality in convolution won’t affect too much. We think there should be some optimization in terms of how to explore the footprint in memory and Halide makes it easy to define the way we want to handle the workload.

Also the backpropagation could be requires intensive computation and some intermeida differential result for chain rule.

CONSTRAINTS

The forward and backpropagation might leave more memory footprint than a machine in cluster can have. We need to find a decomposition method to reduce the affect of that.

Also, if we add padding in matrix to avoid the data shrink too quickly, there can be diversity in work load of different threads or even machines. We have to find a optimized work schedule logic to balance the load.

RESOURCES

We are going to use GHC machines, and start form scratch in Halide. We will implement Google Inception Net first. There are some implementation in Caffe and Python, we will try to compete their performance.

This Git has some helpful informations about how to implement this net.

More reference will be added if we find some useful reference paper/implementations.

GOALS AND DELIVERABLES


PLAN TO ACHIEVE
1. A Working Google Inception Net in Halide
2. A working deep learning framework in Halide which supports CNN, Inception Net, and VGG.

HOPE TO ACHIEVE

  1. Automatic Schedule generated by PolyMage and the speedup observed from hand-tuned version.
  2. Performance compete with Caffe, MXNet on Inception Net, VGG, CNN, and etc.

PLATFORM CHOICE

We choose to use lateday clusters as our platform and Halide as our language.

As for Halide language. First of all, it is a DSL for image processing, which gives it advantages in this project as we are using image data set; also the tuning producure can be more easier and efficient in Halide than other languages like C++: by only defineing schedule, we can try more configures in short time. And as Professor Kayvon shows, it is usually faster to find optimal settings in Halide.

SCHEDULE

Timeline Goal to achieve
April 8 Understand Google Inception Net’s Layers and how does they work
April 15 Scratch the framework and define primitives
April 22 Implement the framework in Halide and test on CIFAR-10 dataset
April 29 Tune different schedule to improve performance, contact PolyMage for automatic shedule, and compete with Caffe and MXNet
May 7 Wrap up implementatin and compare/analyse the performance in report







Checkpoint

Progress review

We finished literature review for the project in the first two weeks after submitting the proposal. After understanding that Google Inception Net contains 8 unique layers and they can be used to construct CNN and VGG and other nets, we decide to focus on implementing Google Inception net, which essentially is implementing the 8 unique layers.

We are currently stuck on building caffe2 in latedays cluster. There are many dependencies missing. While waiting for instructors to help, we decide to move on and build our own libraies or proceed with Halide implementation first.

SCHEDULE

Timeline Goal to achieve Progress
April 8 Understand Google Inception Net’s Layers and how does they work Done
April 15 Scratch the framework and define primitives Done
April 22 Implement the framework in Halide and test on CIFAR-10 dataset Ongoing
April 29 Tune different schedule to improve performance, contact PolyMage for automatic shedule, and compete with Caffe and MXNet Not Started
May 7 Wrap up implementatin and compare/analyse the performance in report Not Started

Updated Goal and Deliverables

We’ve put lots of hours to build caffe and caffe2 on lateday cluster. We decide to make our scope to be only support protobuf and json input. We could expend the supported input in the future. For now, the ability of taking in protobuf and json is the most important.

Also, We decide to focus on implementation of our H-Piper framework instead of tuning scheduling. To have this framework ready could help investigate locality-parallelism tradeoff and also PolyMage could benifit from our framework by testing its automatically generated schedules on different nets.

The Performance Compete will be done between Caffe’s performance VS PolyMage Schedule’s performance.

What to Show in Parallelism Competition

We want to show the comparison of the performance of our hand-tuned schedule, the performance of PolyMage automatic schedule, and the performance of Caffe2. So it will be a graph. I think it will be nice to have a demo to show how easy it is to configure a net but given that we only have 6 minutes, we might not do this demo.

Preliminary Results

We don’t have any results for now. We are still working on the correctness of our framework.

Issues

New Schedule with task

Timeline People Task
April 22 Lei Sun Maxpool Layer, Softmax Layer Implementation
April 22 Yang Wu Protobuf Input Parser and Data Layer
April 29 Lei Sun Test cases build. Caffe2 Dependencies.
April 29 Yang Wu Conv Layer, DepthConcat Layer, Net Creation
May 7 Lei Sun, Yang Wu Wrap up implementatin and compare/analyse the performance in report