说明
您可以配合实际操作视频教程一起学习:https://www.bilibili.com/video/BV13T4y1f7Cu/
官方Yolov5文档:https://github.com/ultralytics/yolov5/
官方Docker:docker.io/ultralytics/yolov5:v4.0
MLU SDK版本:1.7.0
MLU PYTORCH版本:1.3.0
【以下版本在v4版本上完成(v5和v4类似)】
高版本PyTorch降级到低版本PyTorch
为什么要做这一步?
目前训练,已经在1.8.x的PyTorch上进行,但是MLU的PyTorch还是1.3.0版本。
高版本的PyTorch带有zip压缩模型功能,但是在1.3.0上不支持,如果在1.3.0的版本上直接打开高版本的PT,会出现报错。
利用官方的Docker来简化我们的环境搭建工作
官方Docker:docker.io/ultralytics/yolov5:v4.0
搭建高版本(1.18) PyTorch环境
#/bin/bash
set -x
export MY_CONTAINER="hub_yolov5_v4_0"
num=`docker ps -a|grep "$MY_CONTAINER"|wc -l`
echo $num
echo $MY_CONTAINER
if [ 0 -eq $num ];then
#xhost +
nvidia-docker run -it --rm --gpus=all --ipc=host --name $MY_CONTAINER \
docker.io/ultralytics/yolov5:v4.0 /bin/bash
else
docker start $MY_CONTAINER
docker exec -ti --env COLUMNS=`tput cols` --env LINES=`tput lines` $MY_CONTAINER /bin/bash
fi
这里启动 docker.io/ultralytics/yolov5:v4.0 镜像
降PyTorch的版本
进入容器后,到目录 /usr/src/app
修改detect.py
torch.save(model.state_dict(), "unzip.pt", _use_new_zipfile_serialization=False)
下载weights文件:
如果默认执行,会默认下载5.0版本的,这里我们统一用的4.0版本,为了避免出现奇怪的问题,我们手动下载4.0的版本。
https://github.com/ultralytics/yolov5/releases/tag/v4.0
运行
python detect.py --device cpu --weights yolov5.pt
生成unzip.pt文件,这个pt可以放到MLU PyTorch使用。
到此,PT降版本工作完成。
增加算子
修改/workspace/neuware_sdk_ubuntu_prebuild/venv/lib/python3.6/site-packages/torch/nn/modules/activation.py,添加SiLU, HardTanh激活函数
class Hardswish(Module): # export-friendly version of nn.Hardswish()
@staticmethod
def forward(x):
# return x * F.hardsigmoid(x) # for torchscript and CoreML
return x * F.hardtanh(x + 3, 0., 6.) / 6. # for torchscript, CoreML and ONNX
class SiLU(Module): # export-friendly version of nn.SiLU()
@staticmethod
def forward(x):
return x * torch.sigmoid(x)
修改/workspace/neuware_sdk_ubuntu_prebuild/venv/lib/python3.6/site-packages/torch/nn/modules/__init__.py,注册SiLU, HardTanh激活函数
1,from .activation import 中添加SiLU, Hardswish,
2,__all__ = 中添加'SiLU', 'Hardswish',
MLU移植工作
首先请确认您安装的是MLU SDK 1.7.0版本,具体的移植步骤如下:
1. 环境搭建(云平台已搭好)
使用镜像:yellow.hub.cambricon.com/pytorch/pytorch:0.15.0-ubuntu16.04
#/bin/bash
export MY_CONTAINER="Cambricon-MLU270-v1.7.0-pytorch"
num=`docker ps -a|grep "$MY_CONTAINER"|wc -l`
echo $num
echo $MY_CONTAINER
if [ 0 -eq $num ];then
xhost +
docker run -e DISPLAY=unix$DISPLAY --device /dev/cambricon_dev0 --net=host --pid=host -v /sys/kernel/debug:/sys/kernel/debug -v /tmp/.X11-unix:/tmp/.X11-unix -it --privileged --name $MY_CONTAINER -v $PWD/Cambricon-MLU270/:/home/Cambricon-MLU270 \
-v $PWD/../tar_package/datasets:/home/Cambricon-MLU270/datasets \
-v $PWD/../tar_package/models:/home/Cambricon-MLU270/models \
yellow.hub.cambricon.com/pytorch/pytorch:0.15.0-ubuntu16.04 /bin/bash
else
docker start $MY_CONTAINER
#sudo docker attach $MY_CONTAINER
docker exec -ti $MY_CONTAINER /bin/bash
fi
-v部分自定义
2. 下载yolov5:v4.0版本
git clone https://github.com/ultralytics/yolov5.git
git checkout v4.0
目录为yolov5
3. 激活MLU PyTorch
source /torch/venv3/pytorch/bin/activate
4. 修改部分代码
在yolov5的目录下,需要修改部分代码
1)引入库
(pytorch) root@localhost:/home/Cambricon-MLU270/test/yolov5s-src# git diff
diff --git a/models/experimental.py b/models/experimental.py
index 2dbbf7f..c83c685 100644
--- a/models/experimental.py
+++ b/models/experimental.py
@@ -1,4 +1,6 @@
# This file contains experimental modules
+import matplotlib
+matplotlib.use('Agg')
import numpy as np
import torch
在 models/experimental.py,头部增加 import matplotlib 和 matplotlib.use('Agg') 两行
2). 修改读取model代码
def attempt_load(weights, map_location=None):
from models.yolo import Model
model = Model('./models/yolov5s.yaml')
state_dict = torch.load(weights[0], map_location='cpu')
model.float().fuse().eval()
model.load_state_dict(state_dict, strict=False)
model.float().fuse().eval()
return model
这里默认使用yolov5s,所以固定载入 ./models/yolov5s.yaml
3)验证
python detect.py --device cpu --weights yolov5s-v4.pt
到此,官方的pt模型,能在MLU PYTORCH上运行了。
量化
1. 在detect.py中,在 Run inference上增加代码
import torch_mlu
import torch_mlu.core.mlu_quantize as mlu_quantize
import torch_mlu.core.mlu_model as ct
...
global quantized_model
if opt.cfg == 'qua':
qconfig = {'iteration':2,'firstconv':False}
quantized_model = mlu_quantize.quantize_dynamic_mlu(model, qconfig, dtype='int8', gen_quant=True)
# Run inference
t0 = time.time()
img = torch.zeros((1, 3, imgsz, imgsz), device=device) # init img
....
# Inference
t1 = time_synchronized()
if opt.cfg == 'cpu':
pred = model(img, augment=opt.augment)[0]
print('run cpu')
elif opt.cfg == 'qua':
pred = quantized_model(img)[0]
torch.save(quantized_model.state_dict(), 'yolov5s_int8.pt')
print('run qua')
if __name__ == '__main__':
...
parser.add_argument('--cfg', default='cpu', help='qua and off')
parser.add_argument('--jit', type=bool,default=False)
...
2. 执行
python detect.py --weights yolov5s-v4.pt --cfg qua
torch.save(quantized_model.state_dict(), 'yolov5s_int8.pt')
这句会保存一个量化的pt
逐层
由于每一层都需要跑在mlu上,通过打印可以看到其他层目前是能支持的,最后一层有mlu算子完成。
models/yolo.py
diff --git a/models/yolo.py b/models/yolo.py
old mode 100644
new mode 100755
index 5dc8b57..fa0e9fa
--- a/models/yolo.py
+++ b/models/yolo.py
@@ -26,6 +26,8 @@ class Detect(nn.Module):
def __init__(self, nc=80, anchors=(), ch=()): # detection layer
super(Detect, self).__init__()
+ self.anchors_list = list(np.array(anchors).flatten())
+ self.num_anchors = len(self.anchors_list)
self.nc = nc # number of classes
self.no = nc + 5 # number of outputs per anchor
self.nl = len(anchors) # number of detection layers
@@ -36,10 +38,32 @@ class Detect(nn.Module):
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) # shape(nl,1,na,1,1,2)
self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch) # output conv
+ #self.tmp_shape=[[1,255,80,64],[1,255,40,32],[1,255,20,16]]
+ self.img_h = 640
+ self.img_w = 640
+ self.conf_thres = 0.25
+ self.iou_thres = 0.45
+ self.maxBoxNum = 1024
+
def forward(self, x):
# x = x.copy() # for profiling
z = [] # inference output
+ output = []
+
self.training |= self.export
+
+ if x[0].device.type == 'mlu':
+ for i in range(self.nl):
+ x[i] = self.m[i](x[i]) # conv
+ y = x[i].sigmoid()
+ # print('y.shape: ',y.shape)
+ output.append(y)
+
+ detect_out = torch.ops.torch_mlu.yolov5_detection_output(output[0], output[1], output[2],
+ self.anchors_list,self.nc, self.num_anchors,
+ self.img_h, self.img_w, self.conf_thres, self.iou_thres, self.maxBoxNum)
+ # [10, 13, 16, 30, 33, 23,30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326]
+ return detect_out
for i in range(self.nl):
x[i] = self.m[i](x[i]) # conv
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
可以看到,这里将Detect类,修改为,我们mlu支持的算子 torch.ops.torch_mlu.yolov5_detection_output
detect_out = torch.ops.torch_mlu.yolov5_detection_output
这里算子是把img宽高固定为 640 640作为输入。
detect.py需要继续增加逐层运行的方式
增加一个get_boxes的函数,用来从torch.ops.torch_mlu.yolov5_detection_output
获取结果后画框,不再依赖cpu的获取框的后处理方式了。
+import numpy as np
+def get_boxes(prediction, batch_size=1, img_size=640):
+ """
+ Returns detections with shape:
+ (x1, y1, x2, y2, object_conf, class_score, class_pred)
+ """
+ reshape_value = torch.reshape(prediction, (-1, 1))
+
+ num_boxes_final = reshape_value[0].item()
+ print('num_boxes_final: ',num_boxes_final)
+ all_list = [[] for _ in range(batch_size)]
+ for i in range(int(num_boxes_final)):
+ batch_idx = int(reshape_value[64 + i * 7 + 0].item())
+ if batch_idx >= 0 and batch_idx < batch_size:
+ bl = reshape_value[64 + i * 7 + 3].item()
+ br = reshape_value[64 + i * 7 + 4].item()
+ bt = reshape_value[64 + i * 7 + 5].item()
+ bb = reshape_value[64 + i * 7 + 6].item()
+
+ if bt - bl > 0 and bb -br > 0:
+ all_list[batch_idx].append(bl)
+ all_list[batch_idx].append(br)
+ all_list[batch_idx].append(bt)
+ all_list[batch_idx].append(bb)
+ all_list[batch_idx].append(reshape_value[64 + i * 7 + 2].item())
+ # all_list[batch_idx].append(reshape_value[64 + i * 7 + 2].item())
+ all_list[batch_idx].append(reshape_value[64 + i * 7 + 1].item())
+
+ output = [np.array(all_list[i]).reshape(-1, 6) for i in range(batch_size)]
+ # outputs = [torch.FloatTensor(all_list[i]).reshape(-1, 6) for i in range(batch_size)]
+ return output
+ # jdict = []
+ # for si, pred in enumerate(output):
+ # box = pred[:, :4] #x1, y1, x2, y2
+ # for di, d in enumerate(pred):
+ # box_temp = []
+ # box_temp.append(np.round(box[di][0], 3).item())
+ # box_temp.append(np.round(box[di][1], 3).item())
+ # box_temp.append(np.round(box[di][2], 3).item())
+ # box_temp.append(np.round(box[di][3], 3).item())
+ # jdict.append({'bbox': box_temp, 'score': (np.round(d[5], 5)).item()})
+ # sorted_jdict = sorted(jdict, key=lambda x:x['score'], reverse=True)
+ # return sorted_jdict
增加 mlu 的配置方式
def detect(save_img=False):
source, weights, view_img, save_txt, imgsz = opt.source, opt.weights, opt.view_img, opt.save_txt, opt.img_size
@@ -55,6 +102,27 @@ def detect(save_img=False):
names = model.module.names if hasattr(model, 'module') else model.names
colors = [[random.randint(0, 255) for _ in range(3)] for _ in names]
+ global quantized_model
+ global quantized_net
+
+ if opt.cfg == 'qua':
+ qconfig = {'iteration':2,'firstconv':False}
+ quantized_model = mlu_quantize.quantize_dynamic_mlu(model, qconfig, dtype='int8', gen_quant=True)
+
+ elif opt.cfg == 'mlu':
+ from models.yolo import Model
+
+ model = Model('./models/yolov5s.yaml').to(torch.device('cpu'))
+ model.float().fuse().eval()
+
+ quantized_net = torch_mlu.core.mlu_quantize.quantize_dynamic_mlu(model)
+
+ state_dict = torch.load("./yolov5s_int8.pt")
+ quantized_net.load_state_dict(state_dict, strict=False)
+
+ quantized_net.eval()
+ quantized_net.to(ct.mlu_device())
+
# Run inference
增加mlu的推理方式
img = torch.zeros((1, 3, imgsz, imgsz), device=device) # init img
@@ -68,8 +136,37 @@ def detect(save_img=False):
# Inference
t1 = time_synchronized()
- pred = model(img, augment=opt.augment)[0]
+
+
+ if opt.cfg == 'qua':
+ pred = quantized_model(img)[0]
+ torch.save(quantized_model.state_dict(), 'yolov5s_int8.pt')
+ print('run qua')
+
+ elif opt.cfg == 'mlu':
+ img = img.type(torch.HalfTensor).to(ct.mlu_device())
+ img = img.to(ct.mlu_device())
+ pred = quantized_net(img)[0]
+
+ pred=pred.data.cpu().type(torch.FloatTensor)
+ box_result = get_boxes(pred)
+ print("im0s.shape:",im0s.shape)
+ print(box_result)
+ res = box_result[0].tolist()
+
+ with open("yolov5s_mlu_output.txt","w+") as f:
+ for pt in sorted(res, key=lambda x:(x[0],x[1])):
+ f.write("{}\n{}\n{}\n{}\n".format(pt[0],pt[1],pt[2],pt[3]))
+ cv2.rectangle(im0s, (int(pt[0]), int(pt[1])), (int(pt[2]), int(pt[3])), (255,0,0), 2)
+ cv2.imwrite("mlu_out_{}.jpg".format(os.path.basename(path).split('.')[0]), im0s)
+ print('run mlu')
+ elif opt.cfg == 'cpu':
+ pred = model(img, augment=opt.augment)[0]
+ print('run cpu')
+
+ if opt.cfg != 'cpu':
+ continue
# Apply NMS
pred = non_max_suppression(pred, opt.conf_thres, opt.iou_thres, classes=opt.classes, agnostic=opt.agnostic_nms)
t2 = time_synchronized()
运行
python detect.py --device cpu --weights yolov5s-v4.pt --cfg mlu
结果:
这里是对测试图的结果:
融合
增加部分代码,如下的opt.jit部分
+ elif opt.cfg == 'mlu':
+ from models.yolo import Model
+
+ model = Model('./models/yolov5s.yaml').to(torch.device('cpu'))
+ model.float().fuse().eval()
+
+ quantized_net = torch_mlu.core.mlu_quantize.quantize_dynamic_mlu(model)
+
+ state_dict = torch.load("./yolov5s_int8.pt")
+ quantized_net.load_state_dict(state_dict, strict=False)
+
+ quantized_net.eval()
+ quantized_net.to(ct.mlu_device())
+
+ if opt.jit:
+ print("### jit")
+ ct.save_as_cambricon('yolov5s_int8_1_4')
+ torch.set_grad_enabled(False)
+ ct.set_core_number(4)
+ trace_input = torch.randn(1, 3, 640, 640, dtype=torch.float)
+ input_mlu_data = trace_input.type(torch.HalfTensor).to(ct.mlu_device())
+ quantized_net = torch.jit.trace(quan
测试
python detect.py --weights yolov5s-v4.ptt --cfg mlu --jit True
最后得到 yolov5s_int8_1_4.cambricon 离线模型
其他