## Blueoil key code points walkthrough

Joel Nicholls, Kalika Suksomboon, and Atsunori Kanemura

The purpose of this document is to provide an introduction to the software (TensorFlow) level of the Blueoil code. I have in mind especially the people that are reading through the TensorFlow code of Blueoil for the first time, hopefully this document can be a friendly help for them.

### Learning Objectives

Target audience is persons who would like to understand what happens in Blueoil at the TensorFlow level of code abstraction. There is also some introduction to TensorFlow, in case the reader is not so familiar with that framework.

After reading through this document, my hope is that the reader will be able to do things such as modify the hyperparameters, quantization type, or implement new networks in the Blueoil framework.

TensorFlow
Blueoil uses TensorFlow (and Python) to implement neural network training. So, first we go over some of the main components of a TensorFlow program.
Blueoil key code points
We go through some of the key files and code lines in Blueoil that make the training run.
Blueoil Quantization key code points
We look at where the quantization parameters are passed, and the functions that implement quantization.
How to run experiments
Finally, a quick note on how to run experiments.

### TensorFlow

Overall, we are using TensorFlow to implement neural networks. TensorFlow is a framework for Python, which has a lot of the general neural network functionality built-in, so you don’t need to code everything from scratch.

Furthermore, Blueoil is written using TensorFlow. Therefore, first, we will have a quick overview of TensorFlow. This introduction page on the TensorFlow website shows the basic components of the low-level API.

Maybe the two most important components are the graph and the session. The graph is how TensorFlow represents operations between tensors and variables. It tells TensorFlow how things will run once you start things off. The session gives actual values to the graph. Without the session, the graph is like an empty shell of nodes and links between them.

Generally, your TensorFlow model will be defined by tensors and the operations between them. There are different kinds of tensors. For example, variables that will change when optimizing your model. Another important kind of tensor are the placeholders. The placeholders are kind of special in that you can actually feed values into the placeholders. The other tensors in a TensorFlow model are usually closed off to you. TensorFlow is built in this way so that it can be efficient in running the graph. Anyway, the placeholders are what you use to pass labels and images (or other data) into the TensorFlow graph.

The session itself will be run by something like the code below

sess = tf.Session()
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
z = x + y
print(sess.run(z, feed_dict={x: 3, y: 4.5}))

x, y, and z are all tensors. Sess is a new session that has been created. The session always ‘makes things happen’ within a graph. So which graph are we talking about here ? It is the default graph, unless you tell it otherwise. z = x + y defines an operation in the graph. Once this is all set up, using sess.run will run the session to calculate z. Argument feed_dict is used to pass in values for the placeholders. Note that the session will only run the section of the graph that is necessary to get z. In this case, it’s the whole graph, but that won’t necessarily be true. This is an example of how the graph is closed off to the user, in some sense, in order to make the computation more efficient.

Another important point is that when you have a graph with variables (not placeholders), you will need to initialize those variables before trying to calculate anything. In other words, it doesn’t make sense to run the graph using values that are not initialized.

The main thing that makes TensorFlow useful is that it can perform backpropagation on the graph, in order to minimize some loss function. Generally, you will feed a batch of data into the model, the model spits out some predictions, and the loss is a function that measures the distance between what the model predicted and the ground-truth labels. TensorFlow has the most common kinds of loss functions built-in.

Even after having calculated the gradient of all the variables with respect to the loss function, you still need to tell TensorFlow how you’d like it to do updates on the variables. There are many different ways, called optimizers.

optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

This optimizer method defines a train operation, which updates variables of the graph when you run it. A typical training run will consist in many steps, each using a batch of inputs and their labels to update the model. In this example, 0.01 is the step size it takes to try learn a better value.

A quick point about TensorFlow. Most of the operations in a neural network can be described by matrix operations. GPU’s are good at doing matrix operations. Therefore, it is good to do the training of neural networks using GPU. TensorFlow is set up for making use of GPU.

### Blueoil key code points

Here we’ll see some of the key points in the Blueoil code that correspond to the TensorFlow concepts that we just went over. In this way, I think the general method of how Blueoil works can be understood.

#### The main training file

In this file 1) the session is created 2) there are some optional parts (such as loading pretrained network) 3) the main training loop is run, and 4) test evaluation is also run.

The session is created and initialized :

180    sess = tf.Session(graph=graph, config=session_config)
181    sess.run([init_op, reset_metrics_op])

The main training loop :

228    for step in range(last_step, max_steps):
229        print("step", step)

Running the train step :

271    sess.run([train_op], feed_dict=feed_dict)

Same as in the TensorFlow explanation, the feed_dict contains the batch of images and their labels. Of course, the model itself is also used in this file. It is passed in from the model definition file.

#### Model definition file

This file defines the model itself to be used. There are several different files, depending on what kind of problem you want to solve. This file is for classification. The main bit can be seen below. There is a sequence of convolutions, with a final layer (whose output channels is the number of classes), then some global pooling.

76    x = _lmnet_block('conv1', images, 32, 3)
77    x = _lmnet_block('conv2', x, 64, 3)
78    x = self._space_to_depth(name='pool2', inputs=x)
79    x = _lmnet_block('conv3', x, 128, 3)
80    x = _lmnet_block('conv4', x, 64, 3)
81    x = self._space_to_depth(name='pool4', inputs=x)
82    x = _lmnet_block('conv5', x, 128, 3)
83    x = self._space_to_depth(name='pool5', inputs=x)
84    x = _lmnet_block('conv6', x, 64, 1, activation=tf.nn.relu)
85
86    x = tf.layers.dropout(x, training=is_training)
87
88    kernel_initializer = tf.random_normal_initializer(mean=0.0, stddev=0.01)
89    x = tf.layers.conv2d(name='conv7',
90                         inputs=x,
91                         filters=self.num_classes,
92                         kernel_size=1,
93                         kernel_initializer=kernel_initializer,
94                         activation=None,
95                         use_bias=True,
96                         data_format=channels_data_format)
97
98    self._heatmap_layer = x
99
100   h = x.get_shape()[1].value if self.data_format == 'NHWC' else x.get_shape()[2].value
101   w = x.get_shape()[2].value if self.data_format == 'NHWC' else x.get_shape()[3].value
102   x = tf.layers.average_pooling2d(name='pool7',
103                                   inputs=x,
104                                   pool_size=[h, w],
106                                   strides=1,
107                                   data_format=channels_data_format)
108
109   self.base_output = tf.reshape(x, [-1, self.num_classes], name='pool7_reshape')

There are various different hyperparameters that are used in the train file and the model file. These hyperparameters come from the config file.

#### Config file

The config file has various hyperparameters related to the model and training. For example, the learning rate scheduler

80    NETWORK.LEARNING_RATE_KWARGS = {
81        "values": [0.01, 0.001, 0.0001, 0.00001],
82        "boundaries": [step_per_epoch * 200, step_per_epoch * 300, step_per_epoch * 350],
83    }

There are also hyperparameters on which dataset to use, and which model to use.

39    NETWORK_CLASS = LmnetQuantize
40    DATASET_CLASS = Cifar100

The specific config file I linked to is for quantized training of lmnet_v1 model, for the CIFAR100 dataset, with a specific set of hyperparameters for how to train it. In the higher-level code of Blueoil, the config file is generated by a question-answer with the user. This is much more user-friendly, it is not necessary to go through config files to make manual tweaks each time. Instead, it is possible to use the higher-level code to easily generate the new config file for the specific use case.

### Blueoil Quantization key code points

Finally now we get to the parts of the code specific to quantized training. One point is that in the same model definition file as mentioned earlier, there is a child class to the main model definition, which is called (in this case) LmnetV1Quantize (LmnetV1). It is this child class that establishes the activation and weight quantization of the network.

148    self.activation = activation_quantizer(**activation_quantizer_kwargs)
149    weight_quantization = weight_quantizer(**weight_quantizer_kwargs)
150    self.custom_getter = functools.partial(self._quantized_variable_getter,
151        weight_quantization=weight_quantization)

The quantized variable getter says which variables to quantize. The self.activation will be passed to the main network for quantizing the activations, and the self.custom_getter will be passed to the main network for quantizing the weights. To be specific, these two things are used in the block definition file.

#### Block definition file

This is a template file that is to be used for all quantized layers. In other words, if you add a layer to the main network without using the block template, it will not be a quantized layer.

88    with tf.variable_scope(name, custom_getter=custom_getter):

Is the line that causes the weight quantization to be used. The custom getter was defined as quantizing the weights by a custom definition. Therefore, using the scope of this custom getter allows the weights to be retrieved in quantized form.

117    if activation:
118        output = activation(biased)

These are the the two lines that cause the activations to be quantized. The activation that gets passed to this file has custom-defined forward and backward properties, similar to the weight quantizer.

#### Activation quantizer definition file

In this file is the specific definition for the activation quantizer. This definition gets passed into the lower part of the model definition file (mentioned previously), and is named in the config file.

96    @Defun(dtype, tf.int32, tf.float32, python_grad_func=_backward,
97      shape_func=lambda op: [op.inputs[0].get_shape()],
99    def _func(x, bit, max_value):
100       n = tf.pow(2., tf.cast(bit, dtype=tf.float32)) - 1
101       value_range = max_value - min_value
102
103       x = tf.clip_by_value(x, min_value, max_value, name="clip")
104       shifted = (x - min_value) / value_range
105       quantized = tf.round(shifted * n) / n
106       unshifted = quantized * value_range + min_value
107       return unshifted

This function is used in the definition of the custom forward of the activation quantizer, as a TensorFlow operation. There is also a custom definition for the backpass, too. The clip_by_value, round, etc. is the actual function being used for the custom forward operation. This one is doing mid-tread uniform quantization.

#### Weight quantizer definition file

This file has a similar kind of spirit as the activation quantizer file. Here, a custom forward and backward for the weight quantizer function is defined. I would say the main difference is that this will be used as a custom getter for the weights, rather than as a clearly separate operation like the activation quantizer.

Also, thinking in terms of the hardware implementation, this operation will not be used in the inference. Instead, all the weights will be permanently converted to quantized values at the inference time. However, at the TensorFlow level, the weight quantization operation is always used, even in the inference. Another point is that even the quantized weights are fake-quantized at the TensorFlow level, because they are stored in high bitwidth.

81    # x kernel shape is [height, width, in_channels, out_channels]
82    scaling_factor = tf.reduce_mean(tf.abs(x), axis=[0, 1, 2])
83    # TODO(wakisaka): tensorflow raise error.
84    # tf.summary.histogram("scaling_factor", scaling_factor)
85    quantized = tf.sign(x) * scaling_factor

Here is shown the channelwise quantizer. There is also layerwise quantizer in the same file. The difference between these two types of weight quantizers is that the channelwise quantizer has one scaling factor per output channel, while the layerwise quantizer has one scaling factor per layer.

### How to run experiments

Using Blueoil allows for easy creation of models, for custom dataset. If you are trying to make your own config file, or testing out new networks, it is possible from the (more low level) LMnet directory of Blueoil.