TorchVision Transforms API目標檢測實例語義分割視頻類

更新時間：2022年11月09日 15:37:09 作者：神經(jīng)星星

這篇文章主要為大家介紹了TorchVision Transforms API大升級，支持目標檢測、實例/語義分割及視頻類任務(wù)示例詳解，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進步，早日升職加薪

內(nèi)容導(dǎo)讀

TorchVision Transforms API 擴展升級，現(xiàn)已支持目標檢測、實例及語義分割以及視頻類任務(wù)。新 API 尚處于測試階段，開發(fā)者可以試用體驗。

本文首發(fā)自微信公眾號：PyTorch 開發(fā)者社區(qū)

TorchVision 現(xiàn)已針對 Transforms API 進行了擴展， 具體如下：

除用于圖像分類外，現(xiàn)在還可以用其進行目標檢測、實例及語義分割以及視頻分類等任務(wù)；
支持從 TorchVision 直接導(dǎo)入 SoTA 數(shù)據(jù)增強，如 MixUp、 CutMix、Large Scale Jitter 以及 SimpleCopyPaste。
支持使用全新的 functional transforms 轉(zhuǎn)換視頻、Bounding box 以及分割掩碼 (Segmentation Mask)。

Transforms 當(dāng)前的局限性

穩(wěn)定版 TorchVision Transforms API，也也就是我們常說的 Transforms V1，只支持單個圖像，因此，只適用于分類任務(wù)：

from torchvision import transforms
trans = transforms.Compose([
   transforms.ColorJitter(contrast=0.5),
   transforms.RandomRotation(30),
   transforms.CenterCrop(480),
])
imgs = trans(imgs)

上述方法不支持需要使用 Label 的目標檢測、分割或分類 Transforms， 如 MixUp 及 cutMix。這使分類以外的計算機視覺任務(wù)都不能用 Transforms API 執(zhí)行必要的擴展。同時，這也加大了用 TorchVision 原語訓(xùn)練高精度模型的難度。

為了克服這個局限性，TorchVision 在其 reference script 中提供了自定義實現(xiàn)， 用于演示所有任務(wù)中的增強是如何執(zhí)行的。

盡管這種做法使得開發(fā)者能夠訓(xùn)練出高精度的分類、目標檢測及分割模型，但做法比較粗糙，TorchVision 二進制文件中還是不能導(dǎo)入 Transforms。

全新的 Transforms API

Transforms V2 API 支持視頻、bounding box、label 以及分割掩碼， 這意味著它為許多計算機視覺任務(wù)提供了本地支持。新的解決方案是一種更為直接的替代方案:

from torchvision.prototype import transforms
# Exactly the same interface as V1:
trans = transforms.Compose([
    transforms.ColorJitter(contrast=0.5),
    transforms.RandomRotation(30),
    transforms.CenterCrop(480),
])
imgs, bboxes, labels = trans(imgs, bboxes, labels)

全新的 Transform Class 無需強制執(zhí)行特定的順序或結(jié)構(gòu)，就可以接收任意數(shù)量的輸入：

# Already supported:
trans(imgs)  # Image Classification
trans(videos)  # Video Tasks
trans(imgs_or_videos, labels)  # MixUp/CutMix-style Transforms
trans(imgs, bboxes, labels)  # Object Detection
trans(imgs, bboxes, masks, labels)  # Instance Segmentation
trans(imgs, masks)  # Semantic Segmentation
trans({"image": imgs, "box": bboxes, "tag": labels})  # Arbitrary Structure
# Future support:
trans(imgs, bboxes, labels, keypoints)  # Keypoint Detection
trans(stereo_images, disparities, masks)  # Depth Perception
trans(image1, image2, optical_flows, masks)  # Optical Flow

functional API 已經(jīng)更新，支持所有輸入必要的 signal processing kernel，如 resizing, cropping, affine transforms, padding 等：

from torchvision.prototype.transforms import functional as F
# High-level dispatcher, accepts any supported input type, fully BC
F.resize(inpt, resize=[224, 224])
# Image tensor kernel
F.resize_image_tensor(img_tensor, resize=[224, 224], antialias=True)
# PIL image kernel
F.resize_image_pil(img_pil, resize=[224, 224], interpolation=BILINEAR)
# Video kernel
F.resize_video(video, resize=[224, 224], antialias=True)
# Mask kernel
F.resize_mask(mask, resize=[224, 224])
# Bounding box kernel
F.resize_bounding_box(bbox, resize=[224, 224], spatial_size=[256, 256])

API 使用 Tensor subclassing 來包裝輸入，附加有用的元數(shù)據(jù)，并 dispatch 到正確的內(nèi)核。 利用 TorchData Data Pipe 的 Datasets V2 相關(guān)工作完成后，就不再需要手動包裝輸入了。目前，用戶可以通過以下方式手動包裝輸入：

from torchvision.prototype import features
imgs = features.Image(images, color_space=ColorSpace.RGB)
vids = features.Video(videos, color_space=ColorSpace.RGB)
masks = features.Mask(target["masks"])
bboxes = features.BoundingBox(target["boxes"], format=BoundingBoxFormat.XYXY, spatial_size=imgs.spatial_size)
labels = features.Label(target["labels"], categories=["dog", "cat"])

除新 API 之外，PyTorch 官方還為 SoTA 研究中用到的一些數(shù)據(jù)增強提供了重要實現(xiàn)，如 MixUp、 CutMix、Large Scale Jitter、 SimpleCopyPaste、AutoAugmentation 方法以及一些新的 Geometric、Colour 和 Type Conversion transforms。

該 API 繼續(xù)支持 single image 或 batched input image 的 PIL 和 Tensor 后端，并在 functional API 上保留了 JIT-scriptability。這使得圖像映射得以從 uint8 延遲到 float， 帶來了性能的進一步提升。

它目前可以在 TorchVision 的原型區(qū)域 (prototype area) 中使用，并且支持從 nightly build 版本中導(dǎo)入。經(jīng)驗證，新 API 與先前實現(xiàn)的準確性一致。

當(dāng)前的局限性

functional API (kernel) 仍然保持 JIT-scriptable 及 fully-BC，Transform Class 提供了相同的接口，卻無法使用腳本。

這是因為 Transform Class 使用的是張量子類 (Tensor Subclassing)，且接收任意數(shù)量的輸入，這是 JIT 所不支持的。該局限將在后續(xù)版本中不斷優(yōu)化。

一個端到端示

以下是一個新 API 示例，它可以同時使用 PIL 圖像和張量。

測試圖片：

代碼示例：

import PIL
from torchvision import io, utils
from torchvision.prototype import features, transforms as T
from torchvision.prototype.transforms import functional as F
# Defining and wrapping input to appropriate Tensor Subclasses
path = "COCO_val2014_000000418825.jpg"
img = features.Image(io.read_image(path), color_space=features.ColorSpace.RGB)
# img = PIL.Image.open(path)
bboxes = features.BoundingBox(
    [[2, 0, 206, 253], [396, 92, 479, 241], [328, 253, 417, 332],
     [148, 68, 256, 182], [93, 158, 170, 260], [432, 0, 438, 26],
     [422, 0, 480, 25], [419, 39, 424, 52], [448, 37, 456, 62],
     [435, 43, 437, 50], [461, 36, 469, 63], [461, 75, 469, 94],
     [469, 36, 480, 64], [440, 37, 446, 56], [398, 233, 480, 304],
     [452, 39, 463, 63], [424, 38, 429, 50]],
    format=features.BoundingBoxFormat.XYXY,
    spatial_size=F.get_spatial_size(img),
)
labels = features.Label([59, 58, 50, 64, 76, 74, 74, 74, 74, 74, 74, 74, 74, 74, 50, 74, 74])
# Defining and applying Transforms V2
trans = T.Compose(
    [
        T.ColorJitter(contrast=0.5),
        T.RandomRotation(30),
        T.CenterCrop(480),
    ]
)
img, bboxes, labels = trans(img, bboxes, labels)
# Visualizing results
viz = utils.draw_bounding_boxes(F.to_image_tensor(img), boxes=bboxes)
F.to_pil_image(viz).show()

以上就是TorchVision Transforms API目標檢測實例語義分割視頻類的詳細內(nèi)容，更多關(guān)于TorchVision Transforms API的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: