Yolov5移植教程(配实际操作视频)

Yolov5移植教程(配实际操作视频)

难易程度: |实验人次:191

说明

您可以配合实际操作视频教程一起学习:https://www.bilibili.com/video/BV13T4y1f7Cu/

 

官方Yolov5文档:https://github.com/ultralytics/yolov5/

官方Docker:docker.io/ultralytics/yolov5:v4.0

https://hub.docker.com/

 

MLU SDK版本:1.7.0

MLU PYTORCH版本:1.3.0

 

【以下版本在v4版本上完成(v5和v4类似)】

高版本PyTorch降级到低版本PyTorch

为什么要做这一步?

目前训练,已经在1.8.x的PyTorch上进行,但是MLU的PyTorch还是1.3.0版本。

高版本的PyTorch带有zip压缩模型功能,但是在1.3.0上不支持,如果在1.3.0的版本上直接打开高版本的PT,会出现报错。

利用官方的Docker来简化我们的环境搭建工作

官方Docker:docker.io/ultralytics/yolov5:v4.0

 

搭建高版本(1.18) PyTorch环境

#/bin/bash

set -x

export MY_CONTAINER="hub_yolov5_v4_0"

num=`docker ps -a|grep "$MY_CONTAINER"|wc -l`

echo $num

echo $MY_CONTAINER

if [ 0 -eq $num ];then

#xhost +

nvidia-docker run -it --rm --gpus=all --ipc=host --name $MY_CONTAINER  \

docker.io/ultralytics/yolov5:v4.0 /bin/bash

else

docker start $MY_CONTAINER

docker exec -ti --env COLUMNS=`tput cols` --env LINES=`tput lines` $MY_CONTAINER /bin/bash

fi

这里启动 docker.io/ultralytics/yolov5:v4.0 镜像

 

降PyTorch的版本

进入容器后,到目录 /usr/src/app

修改detect.py

torch.save(model.state_dict(), "unzip.pt", _use_new_zipfile_serialization=False)

 

下载weights文件:

如果默认执行,会默认下载5.0版本的,这里我们统一用的4.0版本,为了避免出现奇怪的问题,我们手动下载4.0的版本。

https://github.com/ultralytics/yolov5/releases/tag/v4.0

 

运行

python detect.py --device cpu --weights yolov5.pt

生成unzip.pt文件,这个pt可以放到MLU PyTorch使用。

到此,PT降版本工作完成。

 

增加算子

修改/workspace/neuware_sdk_ubuntu_prebuild/venv/lib/python3.6/site-packages/torch/nn/modules/activation.py,添加SiLU, HardTanh激活函数

class Hardswish(Module): # export-friendly version of nn.Hardswish()

   @staticmethod

   def forward(x):

       # return x * F.hardsigmoid(x) # for torchscript and CoreML

       return x * F.hardtanh(x + 3, 0., 6.) / 6. # for torchscript, CoreML and ONNX

 

class SiLU(Module):  # export-friendly version of nn.SiLU()

   @staticmethod

   def forward(x):

       return x * torch.sigmoid(x)

 

修改/workspace/neuware_sdk_ubuntu_prebuild/venv/lib/python3.6/site-packages/torch/nn/modules/__init__.py,注册SiLU, HardTanh激活函数

1,from .activation import 中添加SiLU, Hardswish,

2,__all__ = 中添加'SiLU', 'Hardswish',

MLU移植工作

首先请确认您安装的是MLU SDK 1.7.0版本,具体的移植步骤如下:

 

1. 环境搭建(云平台已搭好)

使用镜像:yellow.hub.cambricon.com/pytorch/pytorch:0.15.0-ubuntu16.04

#/bin/bash

 

export MY_CONTAINER="Cambricon-MLU270-v1.7.0-pytorch"

num=`docker ps -a|grep "$MY_CONTAINER"|wc -l`

echo $num

echo $MY_CONTAINER

if [ 0 -eq $num ];then

xhost +

docker run -e DISPLAY=unix$DISPLAY --device /dev/cambricon_dev0 --net=host --pid=host -v /sys/kernel/debug:/sys/kernel/debug -v /tmp/.X11-unix:/tmp/.X11-unix -it --privileged --name $MY_CONTAINER -v $PWD/Cambricon-MLU270/:/home/Cambricon-MLU270 \

-v $PWD/../tar_package/datasets:/home/Cambricon-MLU270/datasets \

-v $PWD/../tar_package/models:/home/Cambricon-MLU270/models \

yellow.hub.cambricon.com/pytorch/pytorch:0.15.0-ubuntu16.04 /bin/bash

else

docker start $MY_CONTAINER

#sudo docker attach $MY_CONTAINER

docker exec -ti $MY_CONTAINER /bin/bash

fi

 

-v部分自定义

 

2. 下载yolov5:v4.0版本

git clone https://github.com/ultralytics/yolov5.git

git checkout v4.0

目录为yolov5

 

3. 激活MLU PyTorch

source /torch/venv3/pytorch/bin/activate

 

4. 修改部分代码

在yolov5的目录下,需要修改部分代码

1)引入库

(pytorch) root@localhost:/home/Cambricon-MLU270/test/yolov5s-src# git diff

diff --git a/models/experimental.py b/models/experimental.py

index 2dbbf7f..c83c685 100644

--- a/models/experimental.py

+++ b/models/experimental.py

@@ -1,4 +1,6 @@

 # This file contains experimental modules

+import matplotlib

+matplotlib.use('Agg')

 

 import numpy as np

 import torch

在 models/experimental.py,头部增加 import matplotlib 和 matplotlib.use('Agg') 两行

2). 修改读取model代码

def attempt_load(weights, map_location=None):

                  from models.yolo import Model

    model = Model('./models/yolov5s.yaml')

 

    state_dict = torch.load(weights[0], map_location='cpu')

    model.float().fuse().eval()

 

    model.load_state_dict(state_dict, strict=False)

    model.float().fuse().eval()

   

    return model

这里默认使用yolov5s,所以固定载入 ./models/yolov5s.yaml

 

3)验证

python detect.py --device cpu --weights yolov5s-v4.pt

到此,官方的pt模型,能在MLU PYTORCH上运行了。

 

 

量化

1. 在detect.py中,在 Run inference上增加代码

import torch_mlu

import torch_mlu.core.mlu_quantize as mlu_quantize

import torch_mlu.core.mlu_model as ct

 

...

 

global quantized_model

if opt.cfg == 'qua':

    qconfig = {'iteration':2,'firstconv':False}

    quantized_model = mlu_quantize.quantize_dynamic_mlu(model, qconfig, dtype='int8', gen_quant=True)

 

# Run inference

         t0 = time.time()

    img = torch.zeros((1, 3, imgsz, imgsz), device=device)  # init img

         ....

         # Inference

    t1 = time_synchronized()

    if opt.cfg == 'cpu':

        pred = model(img, augment=opt.augment)[0]

        print('run cpu')

 

    elif opt.cfg == 'qua':

        pred = quantized_model(img)[0]

        torch.save(quantized_model.state_dict(), 'yolov5s_int8.pt')

        print('run qua')

 

if __name__ == '__main__':

    ...

    parser.add_argument('--cfg', default='cpu', help='qua and off')

    parser.add_argument('--jit', type=bool,default=False)

...

 

2. 执行

python detect.py --weights yolov5s-v4.pt --cfg qua

torch.save(quantized_model.state_dict(), 'yolov5s_int8.pt')

这句会保存一个量化的pt

 

逐层

由于每一层都需要跑在mlu上,通过打印可以看到其他层目前是能支持的,最后一层有mlu算子完成。

 

models/yolo.py

 

diff --git a/models/yolo.py b/models/yolo.py

old mode 100644

new mode 100755

index 5dc8b57..fa0e9fa

--- a/models/yolo.py

+++ b/models/yolo.py

@@ -26,6 +26,8 @@ class Detect(nn.Module):

 

     def __init__(self, nc=80, anchors=(), ch=()):  # detection layer

         super(Detect, self).__init__()

+        self.anchors_list = list(np.array(anchors).flatten())

+        self.num_anchors = len(self.anchors_list)

         self.nc = nc  # number of classes

         self.no = nc + 5  # number of outputs per anchor

         self.nl = len(anchors)  # number of detection layers

@@ -36,10 +38,32 @@ class Detect(nn.Module):

         self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2))  # shape(nl,1,na,1,1,2)

         self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch)  # output conv

 

+        #self.tmp_shape=[[1,255,80,64],[1,255,40,32],[1,255,20,16]]

+        self.img_h = 640

+        self.img_w = 640

+        self.conf_thres = 0.25

+        self.iou_thres = 0.45

+        self.maxBoxNum = 1024

+

     def forward(self, x):

         # x = x.copy()  # for profiling

         z = []  # inference output

+        output = []

+

         self.training |= self.export

+       

+        if x[0].device.type == 'mlu':

+            for i in range(self.nl):

+                x[i] = self.m[i](x[i])  # conv

+                y = x[i].sigmoid()

+                # print('y.shape: ',y.shape)

+                output.append(y)

+

+            detect_out = torch.ops.torch_mlu.yolov5_detection_output(output[0], output[1], output[2],

+                              self.anchors_list,self.nc, self.num_anchors,

+                             self.img_h, self.img_w, self.conf_thres, self.iou_thres, self.maxBoxNum)

+                            #  [10, 13, 16, 30, 33, 23,30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326]

+            return detect_out

         for i in range(self.nl):

             x[i] = self.m[i](x[i])  # conv

             bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)

 

可以看到,这里将Detect类,修改为,我们mlu支持的算子 torch.ops.torch_mlu.yolov5_detection_output

detect_out = torch.ops.torch_mlu.yolov5_detection_output

这里算子是把img宽高固定为 640 640作为输入。

 

detect.py需要继续增加逐层运行的方式

增加一个get_boxes的函数,用来从torch.ops.torch_mlu.yolov5_detection_output获取结果后画框,不再依赖cpu的获取框的后处理方式了。

 

+import numpy as np

+def get_boxes(prediction, batch_size=1, img_size=640):

+    """

+    Returns detections with shape:

+        (x1, y1, x2, y2, object_conf, class_score, class_pred)

+    """

+    reshape_value = torch.reshape(prediction, (-1, 1))

+

+    num_boxes_final = reshape_value[0].item()

+    print('num_boxes_final: ',num_boxes_final)

+    all_list = [[] for _ in range(batch_size)]

+    for i in range(int(num_boxes_final)):

+        batch_idx = int(reshape_value[64 + i * 7 + 0].item())

+        if batch_idx >= 0 and batch_idx < batch_size:

+            bl = reshape_value[64 + i * 7 + 3].item()

+            br = reshape_value[64 + i * 7 + 4].item()

+            bt = reshape_value[64 + i * 7 + 5].item()

+            bb = reshape_value[64 + i * 7 + 6].item()

+

+            if bt - bl > 0 and bb -br > 0:

+                all_list[batch_idx].append(bl)

+                all_list[batch_idx].append(br)

+                all_list[batch_idx].append(bt)

+                all_list[batch_idx].append(bb)

+                all_list[batch_idx].append(reshape_value[64 + i * 7 + 2].item())

+                # all_list[batch_idx].append(reshape_value[64 + i * 7 + 2].item())

+                all_list[batch_idx].append(reshape_value[64 + i * 7 + 1].item())

+

+    output = [np.array(all_list[i]).reshape(-1, 6) for i in range(batch_size)]

+    # outputs = [torch.FloatTensor(all_list[i]).reshape(-1, 6) for i in range(batch_size)]

+    return output

+    # jdict = []

+    # for si, pred in enumerate(output):

+    #     box = pred[:, :4]  #x1, y1, x2, y2

+    #     for di, d in enumerate(pred):

+    #         box_temp = []

+    #         box_temp.append(np.round(box[di][0], 3).item())

+    #         box_temp.append(np.round(box[di][1], 3).item())

+    #         box_temp.append(np.round(box[di][2], 3).item())

+    #         box_temp.append(np.round(box[di][3], 3).item())

+    #         jdict.append({'bbox': box_temp, 'score': (np.round(d[5], 5)).item()})

+    # sorted_jdict = sorted(jdict, key=lambda x:x['score'], reverse=True)

+    # return sorted_jdict

 

增加 mlu 的配置方式

 def detect(save_img=False):

     source, weights, view_img, save_txt, imgsz = opt.source, opt.weights, opt.view_img, opt.save_txt, opt.img_size

@@ -55,6 +102,27 @@ def detect(save_img=False):

     names = model.module.names if hasattr(model, 'module') else model.names

     colors = [[random.randint(0, 255) for _ in range(3)] for _ in names]

 

+    global quantized_model

+    global quantized_net

+

+    if opt.cfg == 'qua':

+        qconfig = {'iteration':2,'firstconv':False}

+        quantized_model = mlu_quantize.quantize_dynamic_mlu(model, qconfig, dtype='int8', gen_quant=True)

+   

+    elif opt.cfg == 'mlu':

+        from models.yolo import Model

+

+        model = Model('./models/yolov5s.yaml').to(torch.device('cpu'))

+        model.float().fuse().eval()

+

+        quantized_net = torch_mlu.core.mlu_quantize.quantize_dynamic_mlu(model)

+

+        state_dict = torch.load("./yolov5s_int8.pt")

+        quantized_net.load_state_dict(state_dict, strict=False)

+

+        quantized_net.eval()

+        quantized_net.to(ct.mlu_device())

+       

     # Run inference

 

增加mlu的推理方式

  img = torch.zeros((1, 3, imgsz, imgsz), device=device)  # init img

@@ -68,8 +136,37 @@ def detect(save_img=False):

 

         # Inference

         t1 = time_synchronized()

-        pred = model(img, augment=opt.augment)[0]

+       

+

+        if opt.cfg == 'qua':

+            pred = quantized_model(img)[0]

+            torch.save(quantized_model.state_dict(), 'yolov5s_int8.pt')

+            print('run qua')

+       

+        elif opt.cfg == 'mlu':

+            img = img.type(torch.HalfTensor).to(ct.mlu_device())

+            img = img.to(ct.mlu_device())

+            pred = quantized_net(img)[0]

+           

+            pred=pred.data.cpu().type(torch.FloatTensor)

+            box_result = get_boxes(pred)

+            print("im0s.shape:",im0s.shape)

+            print(box_result)                 

+            res = box_result[0].tolist()

+           

+            with open("yolov5s_mlu_output.txt","w+") as f:

+                for pt in sorted(res, key=lambda x:(x[0],x[1])):

+                    f.write("{}\n{}\n{}\n{}\n".format(pt[0],pt[1],pt[2],pt[3]))                 

+                    cv2.rectangle(im0s, (int(pt[0]), int(pt[1])), (int(pt[2]), int(pt[3])), (255,0,0), 2)               

+                cv2.imwrite("mlu_out_{}.jpg".format(os.path.basename(path).split('.')[0]), im0s)   

+            print('run mlu')

 

+        elif opt.cfg == 'cpu':

+            pred = model(img, augment=opt.augment)[0]

+            print('run cpu')

+       

+        if opt.cfg != 'cpu':

+            continue

         # Apply NMS

         pred = non_max_suppression(pred, opt.conf_thres, opt.iou_thres, classes=opt.classes, agnostic=opt.agnostic_nms)

         t2 = time_synchronized()

 

运行

python detect.py --device cpu --weights yolov5s-v4.pt --cfg mlu

 

结果:

 

这里是对测试图的结果:

 

融合

增加部分代码,如下的opt.jit部分

+    elif opt.cfg == 'mlu':

+        from models.yolo import Model

+

+        model = Model('./models/yolov5s.yaml').to(torch.device('cpu'))

+        model.float().fuse().eval()

+

+        quantized_net = torch_mlu.core.mlu_quantize.quantize_dynamic_mlu(model)

+

+        state_dict = torch.load("./yolov5s_int8.pt")

+        quantized_net.load_state_dict(state_dict, strict=False)

+

+        quantized_net.eval()

+        quantized_net.to(ct.mlu_device())

+

+        if opt.jit:

+            print("### jit")

+            ct.save_as_cambricon('yolov5s_int8_1_4')

+            torch.set_grad_enabled(False)

+            ct.set_core_number(4)

+            trace_input = torch.randn(1, 3, 640, 640, dtype=torch.float)

+            input_mlu_data = trace_input.type(torch.HalfTensor).to(ct.mlu_device())

+            quantized_net = torch.jit.trace(quan

 

测试

python detect.py  --weights yolov5s-v4.ptt --cfg mlu --jit True

 

最后得到 yolov5s_int8_1_4.cambricon 离线模型

 

 

其他

申 请 试 用