< Tensorflow >How dose TensorFlow do Quant Aware Training?

Let firstly simplify the Quant process in TF

Overview

1
S_a1(q_a1 + Z_a1) = S_w1(q_w1 + Z_w1) * S_a0(q_a0 + Z_a0)
  • q_a1: Quanted activation value in layer 1

  • S_a1, Z_a1: Estimated scale and zero point in layer 1

  • q_w1: Quanted weight in layer 1

  • S_w1, Z_w1: Statistical scale and zero point in layer 1

  • q_a0: Quanted activation value in layer 0

  • S_a0, Z_a0: Estimated scale and zero point in layer 0

As we can see, in order to compute q_a1(Quanted activation value in layer 1), we have to get S_w1, Z_w1, S_a0, Z_a0, q_a1, Z_a1. To get S_w1/Z_w1 is simple, we can get the Statistical maximum of the weights in each layer we want. The only tricky thing is how to get S_a1/Z_a1/S_a0/Z_a0, which have to be estimated from the training data.

Why we have to get the estimation of S/Z after activation instead of before?

Of course we can estimate the S/Z before activation. And then we activate(Relu) the quantized value. However, there is one drawbacks:

Estimate S/Z before activation

1
conv ->  (before activation)estimate S/Z=[0->255] -> truncation(relu)= [Z->255] (if relu6 : [Z->X])

Estimate S/Z after activation

1
conv ->  (after activation)estimate S/Z=[-A->A] -> truncation(relu)= [0->255](if relu6: [0->255])

As we can see, after activation, the range of activation value for estimating S/Z is always 0 to 255. However, when we estimate S/Z before activation, the range of activation value for estimating S/Z is narrow than 0 to 255. So if we estimate S/Z before activation, the quantized activated is compressed even wore which could lead to accuracy loss.

Quantization aware training in Tensorflow

You can either train your quantized model by restoring a ever trained floating point model or from scratch. In any cases, you have to create a quantization training graph first.

1
tf.contrib.quantize.create_training_graph(quant_delay=DELAY_STEP)

The DELAY_STEP means the number of steps that you want your normal floating point training sustain. So after the DELAY_STEP of normal training, the quantization aware training would be started.

If you use multi-GPU to training your network, you have to create a new quantization graph on every GPU card. Just like these code below:

1
2
3
4
5
6
7
8
9
10
11
12
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(len(GPU_NUM_ID)):
with tf.device('/gpu:%d' % GPU_NUM_ID[i]):
with tf.name_scope('%s_%d' % ('cnn_mg', i)) as scope:
images, abels = load_batch_images()
logits, out_data = net.inference(images, reuse=tf.AUTO_REUSE, num_classes=LABEL_NUM)
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
tf.contrib.quantize.create_training_graph(quant_delay=DELAY_STEP)
loss = conpute_loss(labels, logits)
tf.get_variable_scope().reuse_variables()
grads = optimizer.compute_gradients(loss_total_sep)
tower_grads.append(grads)

One thing I have to mention is that the quantized aware training process is fake training.

Fake training means that during the forward step, the training graph just simulate the integer multiply by using corresponding floating point multiply.

The word ‘Corresponding’ means that the simulated float weights are the inverse quantization value of the corresponding fixed integer.

So the forward result may be slightly different from the actual quantization computed result.

Save, Frozen, Convert and Test

Save

When finishing quantization aware training, you have to save your trained quantized model.

To save your quantized model, you have to create a quantized evaluation graph by using the following code:

1
2
g = tf.get_default_graph()
tf.contrib.quantize.create_eval_graph(input_graph=g)

Then just get the graph and save it.

1
2
with open('./your_quantized_graph.pb', 'w') as f:
f.write(str(g.as_graph_def()))\

Frozen

To make your model more compact, you can froze your model. Frozen a model means that getting rid of useless operations and fusing redundant operations. To froze your graph, you can use the standard frozen tool.

1
2
3
4
5
bazel build tensorflow/python/tools:freeze_graph && \
bazel-bin/tensorflow/python/tools/freeze_graph \
--input_graph=some_graph_def.pb \
--input_checkpoint=model.ckpt-8361242 \
--output_graph=/tmp/frozen_graph.pb --output_node_names=softmax

Convert

The next step is to convert your frozen graph to tflite for future deploy.

1
2
3
4
5
6
7
8
9
10
11
12
path_to_frozen_graphdef_pb = './your_frozen_graph.pb'
input_shapes = {'validate_input/imgs':[1,320,320,3]}
(tf_verion>1.11)converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(path_to_frozen_graphdef_pb, ['validate_input/imgs'], ['output_node'])
(tf_version<=1.11)converter = tf.contrib.lite.TocoConverter.from_frozen_graph(path_to_frozen_graphdef_pb, ['validate_input/imgs'], ['output_node'])
converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {'validate_input/imgs':(0.,1.)}
converter.allow_custom_ops = True
converter.default_ranges_stats = (0,255)
converter.post_training_quantize = True
tflite_model = converter.convert()
open("sfnv2.tflite", "wb").write(tflite_model)

Test

Finally, your can test your converted tflite. By using the following code, you can test your quantized model:

1
2
3
4
5
6
7
8
9
10
11
interpreter = tf.contrib.lite.Interpreter(model_path="your.tflite") 
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.set_tensor(input_details[0]['index'], batch_validate_img)
interpreter.invoke()
score = interpreter.get_tensor(output_details[0]['index'])
score = score[0][0]
zero_point = xxx
scale = xxx
reverse_socre = scale * (score - zero_point)

One thing to mention is that the final score you get is a fixed point integer value.

You have to convert the fixed point integer value to the corresponding float value.

In order to do that, you have to check the corresponding zero point and scale in the corresponding output layer and then you can transfer the fixed point integer value to get the final float value.

< Tensorflow >How dose TensorFlow do Quant Aware Training?

https://zhengtq.github.io/2019/03/25/tf-quant/

Author

Billy

Posted on

2019-03-25

Updated on

2021-03-13

Licensed under

Comments