Keras搭建Mask?R-CNN實例分割平臺實現(xiàn)源碼
什么是Mask R-CNN
來看看很厲害的Mask R-CNN實例分割的原理吧,還是挺有意思的呢!
Mask R-CNN是He Kaiming大神2017年的力作,其在進(jìn)行目標(biāo)檢測的同時進(jìn)行實例分割,取得了出色的效果。
其網(wǎng)絡(luò)的設(shè)計也比較簡單,在Faster R-CNN基礎(chǔ)上,在原本的兩個分支上(分類+坐標(biāo)回歸)增加了一個分支進(jìn)行語義分割,
Mask R-CNN實現(xiàn)思路
一、預(yù)測部分
1、主干網(wǎng)絡(luò)介紹
Mask-RCNN使用Resnet101作為主干特征提取網(wǎng)絡(luò),對應(yīng)著圖像中的CNN部分,其對輸入進(jìn)來的圖片有尺寸要求,需要可以整除2的6次方。在進(jìn)行特征提取后,利用長寬壓縮了兩次、三次、四次、五次的特征層來進(jìn)行特征金字塔結(jié)構(gòu)的構(gòu)造。
ResNet101有兩個基本的塊,分別名為Conv Block和Identity Block,其中Conv Block輸入和輸出的維度是不一樣的,所以不能連續(xù)串聯(lián),它的作用是改變網(wǎng)絡(luò)的維度;Identity Block輸入維度和輸出維度相同,可以串聯(lián),用于加深網(wǎng)絡(luò)的。
Conv Block的結(jié)構(gòu)如下:
Identity Block的結(jié)構(gòu)如下:
這兩個都是殘差網(wǎng)絡(luò)結(jié)構(gòu)。
以官方使用的coco數(shù)據(jù)集輸入的shape為例,輸入的shape為1024x1024,shape變化如下:
我們?nèi)〕鲩L寬壓縮了兩次、三次、四次、五次的結(jié)果來進(jìn)行特征金字塔結(jié)構(gòu)的構(gòu)造。
實現(xiàn)代碼:
from keras.layers import ZeroPadding2D,Conv2D,MaxPooling2D,BatchNormalization,Activation,Add def identity_block(input_tensor, kernel_size, filters, stage, block, use_bias=True, train_bn=True): nb_filter1, nb_filter2, nb_filter3 = filters conv_name_base = 'res' + str(stage) + block + '_branch' bn_name_base = 'bn' + str(stage) + block + '_branch' x = Conv2D(nb_filter1, (1, 1), name=conv_name_base + '2a', use_bias=use_bias)(input_tensor) x = BatchNormalization(name=bn_name_base + '2a')(x, training=train_bn) x = Activation('relu')(x) x = Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same', name=conv_name_base + '2b', use_bias=use_bias)(x) x = BatchNormalization(name=bn_name_base + '2b')(x, training=train_bn) x = Activation('relu')(x) x = Conv2D(nb_filter3, (1, 1), name=conv_name_base + '2c', use_bias=use_bias)(x) x = BatchNormalization(name=bn_name_base + '2c')(x, training=train_bn) x = Add()([x, input_tensor]) x = Activation('relu', name='res' + str(stage) + block + '_out')(x) return x def conv_block(input_tensor, kernel_size, filters, stage, block, strides=(2, 2), use_bias=True, train_bn=True): nb_filter1, nb_filter2, nb_filter3 = filters conv_name_base = 'res' + str(stage) + block + '_branch' bn_name_base = 'bn' + str(stage) + block + '_branch' x = Conv2D(nb_filter1, (1, 1), strides=strides, name=conv_name_base + '2a', use_bias=use_bias)(input_tensor) x = BatchNormalization(name=bn_name_base + '2a')(x, training=train_bn) x = Activation('relu')(x) x = Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same', name=conv_name_base + '2b', use_bias=use_bias)(x) x = BatchNormalization(name=bn_name_base + '2b')(x, training=train_bn) x = Activation('relu')(x) x = Conv2D(nb_filter3, (1, 1), name=conv_name_base + '2c', use_bias=use_bias)(x) x = BatchNormalization(name=bn_name_base + '2c')(x, training=train_bn) shortcut = Conv2D(nb_filter3, (1, 1), strides=strides, name=conv_name_base + '1', use_bias=use_bias)(input_tensor) shortcut = BatchNormalization(name=bn_name_base + '1')(shortcut, training=train_bn) x = Add()([x, shortcut]) x = Activation('relu', name='res' + str(stage) + block + '_out')(x) return x def get_resnet(input_image,stage5=False, train_bn=True): # Stage 1 x = ZeroPadding2D((3, 3))(input_image) x = Conv2D(64, (7, 7), strides=(2, 2), name='conv1', use_bias=True)(x) x = BatchNormalization(name='bn_conv1')(x, training=train_bn) x = Activation('relu')(x) # Height/4,Width/4,64 C1 = x = MaxPooling2D((3, 3), strides=(2, 2), padding="same")(x) # Stage 2 x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1), train_bn=train_bn) x = identity_block(x, 3, [64, 64, 256], stage=2, block='b', train_bn=train_bn) # Height/4,Width/4,256 C2 = x = identity_block(x, 3, [64, 64, 256], stage=2, block='c', train_bn=train_bn) # Stage 3 x = conv_block(x, 3, [128, 128, 512], stage=3, block='a', train_bn=train_bn) x = identity_block(x, 3, [128, 128, 512], stage=3, block='b', train_bn=train_bn) x = identity_block(x, 3, [128, 128, 512], stage=3, block='c', train_bn=train_bn) # Height/8,Width/8,512 C3 = x = identity_block(x, 3, [128, 128, 512], stage=3, block='d', train_bn=train_bn) # Stage 4 x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a', train_bn=train_bn) block_count = 22 for i in range(block_count): x = identity_block(x, 3, [256, 256, 1024], stage=4, block=chr(98 + i), train_bn=train_bn) # Height/16,Width/16,1024 C4 = x # Stage 5 if stage5: x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a', train_bn=train_bn) x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b', train_bn=train_bn) # Height/32,Width/32,2048 C5 = x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c', train_bn=train_bn) else: C5 = None return [C1, C2, C3, C4, C5]
2、特征金字塔FPN的構(gòu)建
特征金字塔FPN的構(gòu)建是為了實現(xiàn)特征多尺度的融合,在Mask R-CNN當(dāng)中,我們?nèi)〕鲈谥鞲商卣魈崛【W(wǎng)絡(luò)中長寬壓縮了兩次C2、三次C3、四次C4、五次C5的結(jié)果來進(jìn)行特征金字塔結(jié)構(gòu)的構(gòu)造。
提取到的P2、P3、P4、P5、P6可以作為RPN網(wǎng)絡(luò)的有效特征層,利用RPN建議框網(wǎng)絡(luò)對有效特征層進(jìn)行下一步的操作,對先驗框進(jìn)行解碼獲得建議框。
提取到的P2、P3、P4、P5可以作為Classifier和Mask網(wǎng)絡(luò)的有效特征層,利用Classifier預(yù)測框網(wǎng)絡(luò)對有效特征層進(jìn)行下一步的操作,對建議框解碼獲得最終預(yù)測框;利用Mask語義分割網(wǎng)絡(luò)對有效特征層進(jìn)行下一步的操作,獲得每一個預(yù)測框內(nèi)部的語義分割結(jié)果。
實現(xiàn)代碼如下:
# 獲得Resnet里的壓縮程度不同的一些層 _, C2, C3, C4, C5 = get_resnet(input_image, stage5=True, train_bn=config.TRAIN_BN) # 組合成特征金字塔的結(jié)構(gòu) # P5長寬共壓縮了5次 # Height/32,Width/32,256 P5 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c5p5')(C5) # P4長寬共壓縮了4次 # Height/16,Width/16,256 P4 = Add(name="fpn_p4add")([ UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5), Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c4p4')(C4)]) # P4長寬共壓縮了3次 # Height/8,Width/8,256 P3 = Add(name="fpn_p3add")([ UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4), Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c3p3')(C3)]) # P4長寬共壓縮了2次 # Height/4,Width/4,256 P2 = Add(name="fpn_p2add")([ UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3), Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c2p2')(C2)]) # 各自進(jìn)行一次256通道的卷積,此時P2、P3、P4、P5通道數(shù)相同 # Height/4,Width/4,256 P2 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p2")(P2) # Height/8,Width/8,256 P3 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p3")(P3) # Height/16,Width/16,256 P4 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p4")(P4) # Height/32,Width/32,256 P5 = Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p5")(P5) # 在建議框網(wǎng)絡(luò)里面還有一個P6用于獲取建議框 # Height/64,Width/64,256 P6 = MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5) # P2, P3, P4, P5, P6可以用于獲取建議框 rpn_feature_maps = [P2, P3, P4, P5, P6] # P2, P3, P4, P5用于獲取mask信息 mrcnn_feature_maps = [P2, P3, P4, P5]
3、獲得Proposal建議框
由上一步獲得的有效特征層在圖像中就是Feature Map,其有兩個應(yīng)用,一個是和ROIAsign結(jié)合使用、另一個是進(jìn)入到Region Proposal Network進(jìn)行建議框的獲取。
在進(jìn)行建議框獲取的時候,我們使用的有效特征層是P2、P3、P4、P5、P6,它們使用同一個RPN建議框網(wǎng)絡(luò)獲取先驗框調(diào)整參數(shù),還有先驗框內(nèi)部是否包含物體。
在Mask R-cnn中,RPN建議框網(wǎng)絡(luò)的結(jié)構(gòu)和Faster RCNN中的RPN建議框網(wǎng)絡(luò)類似。
首先進(jìn)行一次3x3的通道數(shù)為512的卷積。
然后再分別進(jìn)行一次anchors_per_location x 4的卷積 和一次anchors_per_location x 2的卷積。
anchors_per_location x 4的卷積 用于預(yù)測 公用特征層上 每一個網(wǎng)格點上 每一個先驗框的變化情況。(為什么說是變化情況呢,這是因為Faster-RCNN的預(yù)測結(jié)果需要結(jié)合先驗框獲得預(yù)測框,預(yù)測結(jié)果就是先驗框的變化情況。)
anchors_per_location x 2的卷積 用于預(yù)測 公用特征層上 每一個網(wǎng)格點上 每一個預(yù)測框內(nèi)部是否包含了物體。
當(dāng)我們輸入的圖片的shape是1024x1024x3的時候,公用特征層的shape就是256x256x256、128x128x256、64x64x256、32x32x256、16x16x256,相當(dāng)于把輸入進(jìn)來的圖像分割成不同大小的網(wǎng)格,然后每個網(wǎng)格默認(rèn)存在3(anchors_per_location )個先驗框,這些先驗框有不同的大小,在圖像上密密麻麻。
anchors_per_location x 4的卷積的結(jié)果會對這些先驗框進(jìn)行調(diào)整,獲得一個新的框。anchors_per_location x 2的卷積會判斷上述獲得的新框是否包含物體。
到這里我們可以獲得了一些有用的框,這些框會利用anchors_per_location x 2的卷積判斷是否存在物體。
到此位置還只是粗略的一個框的獲取,也就是一個建議框。然后我們會在建議框里面繼續(xù)找東西。
實現(xiàn)代碼為:
#------------------------------------# # 五個不同大小的特征層會傳入到 # RPN當(dāng)中,獲得建議框 #------------------------------------# def rpn_graph(feature_map, anchors_per_location): shared = Conv2D(512, (3, 3), padding='same', activation='relu', name='rpn_conv_shared')(feature_map) x = Conv2D(2 * anchors_per_location, (1, 1), padding='valid', activation='linear', name='rpn_class_raw')(shared) # batch_size,num_anchors,2 # 代表這個先驗框?qū)?yīng)的類 rpn_class_logits = Reshape([-1,2])(x) rpn_probs = Activation( "softmax", name="rpn_class_xxx")(rpn_class_logits) x = Conv2D(anchors_per_location * 4, (1, 1), padding="valid", activation='linear', name='rpn_bbox_pred')(shared) # batch_size,num_anchors,4 # 這個先驗框的調(diào)整參數(shù) rpn_bbox = Reshape([-1,4])(x) return [rpn_class_logits, rpn_probs, rpn_bbox] #------------------------------------# # 建立建議框網(wǎng)絡(luò)模型 # RPN模型 #------------------------------------# def build_rpn_model(anchors_per_location, depth): input_feature_map = Input(shape=[None, None, depth], name="input_rpn_feature_map") outputs = rpn_graph(input_feature_map, anchors_per_location) return Model([input_feature_map], outputs, name="rpn_model")
4、Proposal建議框的解碼
通過第二步我們獲得了許多個先驗框的預(yù)測結(jié)果。預(yù)測結(jié)果包含兩部分。
anchors_per_location x 4的卷積 用于預(yù)測 有效特征層上 每一個網(wǎng)格點上 每一個先驗框的變化情況。**
anchors_per_location x 1的卷積 用于預(yù)測 有效特征層上 每一個網(wǎng)格點上 每一個預(yù)測框內(nèi)部是否包含了物體。
相當(dāng)于就是將整個圖像分成若干個網(wǎng)格;然后從每個網(wǎng)格中心建立3個先驗框,當(dāng)輸入的圖像是1024,1024,3的時候,總共先驗框數(shù)量為196608+49152+12288+3072+768 = 261,888?
當(dāng)輸入圖像shape不同時,先驗框的數(shù)量也會發(fā)生改變。
先驗框雖然可以代表一定的框的位置信息與框的大小信息,但是其是有限的,無法表示任意情況,因此還需要調(diào)整。
anchors_per_location x 4中的anchors_per_location 表示了這個網(wǎng)格點所包含的先驗框數(shù)量,其中的4表示了框的中心與長寬的調(diào)整情況。
實現(xiàn)代碼如下:
#----------------------------------------------------------# # Proposal Layer # 該部分代碼用于將先驗框轉(zhuǎn)化成建議框 #----------------------------------------------------------# def apply_box_deltas_graph(boxes, deltas): # 計算先驗框的中心和寬高 height = boxes[:, 2] - boxes[:, 0] width = boxes[:, 3] - boxes[:, 1] center_y = boxes[:, 0] + 0.5 * height center_x = boxes[:, 1] + 0.5 * width # 計算出調(diào)整后的先驗框的中心和寬高 center_y += deltas[:, 0] * height center_x += deltas[:, 1] * width height *= tf.exp(deltas[:, 2]) width *= tf.exp(deltas[:, 3]) # 計算左上角和右下角的點的坐標(biāo) y1 = center_y - 0.5 * height x1 = center_x - 0.5 * width y2 = y1 + height x2 = x1 + width result = tf.stack([y1, x1, y2, x2], axis=1, name="apply_box_deltas_out") return result def clip_boxes_graph(boxes, window): """ boxes: [N, (y1, x1, y2, x2)] window: [4] in the form y1, x1, y2, x2 """ # Split wy1, wx1, wy2, wx2 = tf.split(window, 4) y1, x1, y2, x2 = tf.split(boxes, 4, axis=1) # Clip y1 = tf.maximum(tf.minimum(y1, wy2), wy1) x1 = tf.maximum(tf.minimum(x1, wx2), wx1) y2 = tf.maximum(tf.minimum(y2, wy2), wy1) x2 = tf.maximum(tf.minimum(x2, wx2), wx1) clipped = tf.concat([y1, x1, y2, x2], axis=1, name="clipped_boxes") clipped.set_shape((clipped.shape[0], 4)) return clipped class ProposalLayer(Layer): def __init__(self, proposal_count, nms_threshold, config=None, **kwargs): super(ProposalLayer, self).__init__(**kwargs) self.config = config self.proposal_count = proposal_count self.nms_threshold = nms_threshold # [rpn_class, rpn_bbox, anchors] def call(self, inputs): # 代表這個先驗框內(nèi)部是否有物體[batch, num_rois, 1] scores = inputs[0][:, :, 1] # 代表這個先驗框的調(diào)整參數(shù)[batch, num_rois, 4] deltas = inputs[1] # [0.1 0.1 0.2 0.2],改變數(shù)量級 deltas = deltas * np.reshape(self.config.RPN_BBOX_STD_DEV, [1, 1, 4]) # Anchors anchors = inputs[2] # 篩選出得分前6000個的框 pre_nms_limit = tf.minimum(self.config.PRE_NMS_LIMIT, tf.shape(anchors)[1]) # 獲得這些框的索引 ix = tf.nn.top_k(scores, pre_nms_limit, sorted=True, name="top_anchors").indices # 獲得這些框的得分 scores = utils.batch_slice([scores, ix], lambda x, y: tf.gather(x, y), self.config.IMAGES_PER_GPU) # 獲得這些框的調(diào)整參數(shù) deltas = utils.batch_slice([deltas, ix], lambda x, y: tf.gather(x, y), self.config.IMAGES_PER_GPU) # 獲得這些框?qū)?yīng)的先驗框 pre_nms_anchors = utils.batch_slice([anchors, ix], lambda a, x: tf.gather(a, x), self.config.IMAGES_PER_GPU, names=["pre_nms_anchors"]) # [batch, N, (y1, x1, y2, x2)] # 對先驗框進(jìn)行解碼 boxes = utils.batch_slice([pre_nms_anchors, deltas], lambda x, y: apply_box_deltas_graph(x, y), self.config.IMAGES_PER_GPU, names=["refined_anchors"]) # [batch, N, (y1, x1, y2, x2)] # 防止超出圖片范圍 window = np.array([0, 0, 1, 1], dtype=np.float32) boxes = utils.batch_slice(boxes, lambda x: clip_boxes_graph(x, window), self.config.IMAGES_PER_GPU, names=["refined_anchors_clipped"]) # 非極大抑制 def nms(boxes, scores): indices = tf.image.non_max_suppression( boxes, scores, self.proposal_count, self.nms_threshold, name="rpn_non_max_suppression") proposals = tf.gather(boxes, indices) # 如果數(shù)量達(dá)不到設(shè)置的建議框數(shù)量的話 # 就padding padding = tf.maximum(self.proposal_count - tf.shape(proposals)[0], 0) proposals = tf.pad(proposals, [(0, padding), (0, 0)]) return proposals proposals = utils.batch_slice([boxes, scores], nms, self.config.IMAGES_PER_GPU) return proposals def compute_output_shape(self, input_shape): return (None, self.proposal_count, 4)
5、對Proposal建議框加以利用(Roi Align)
讓我們對建議框有一個整體的理解:事實上建議框就是對圖片哪一個區(qū)域有物體存在進(jìn)行初步篩選。
實際上,Mask R-CNN到這里的操作是,通過主干特征提取網(wǎng)絡(luò),我們可以獲得多個公用特征層,然后建議框會對這些公用特征層進(jìn)行截取。
其實公用特征層里的每一個點相當(dāng)于原圖片上某個區(qū)域內(nèi)部所有特征的濃縮。
建議框會對其對應(yīng)的公用特征層進(jìn)行截取,然后將截取的結(jié)果進(jìn)行resize,在classifier模型里,截取后的內(nèi)容會resize到7x7x256的大小。在mask模型里,截取后的內(nèi)容會resize到14x14x256的大小。
在利用建議框?qū)锰卣鲗舆M(jìn)行截取的時候要注意,要找到建議框?qū)儆谀莻€特征層,這個要從建議框的大小進(jìn)行判斷。
在classifier模型里,其會利用一次通道數(shù)為1024的7x7的卷積和一次通道數(shù)為1024的1x1的卷積對ROIAlign獲得的7x7x256的區(qū)域進(jìn)行卷積,兩次通道數(shù)為1024卷積用于模擬兩次1024的全連接,然后再分別全連接到num_classes和num_classes * 4上,分別代表這個建議框內(nèi)的物體,以及這個建議框的調(diào)整參數(shù)。
在mask模型里,其首先會對resize后的局部特征層進(jìn)行四次3x3的256通道的卷積,再進(jìn)行一次反卷積,再進(jìn)行一次通道數(shù)為num_classes的卷積,最終結(jié)果代表每一個像素點分的類。最終的shape為28x28xnum_classes,代表每個像素點的類別。
#------------------------------------# # 五個不同大小的特征層會傳入到 # RPN當(dāng)中,獲得建議框 #------------------------------------# def rpn_graph(feature_map, anchors_per_location): shared = Conv2D(512, (3, 3), padding='same', activation='relu', name='rpn_conv_shared')(feature_map) x = Conv2D(2 * anchors_per_location, (1, 1), padding='valid', activation='linear', name='rpn_class_raw')(shared) # batch_size,num_anchors,2 # 代表這個先驗框?qū)?yīng)的類 rpn_class_logits = Reshape([-1,2])(x) rpn_probs = Activation( "softmax", name="rpn_class_xxx")(rpn_class_logits) x = Conv2D(anchors_per_location * 4, (1, 1), padding="valid", activation='linear', name='rpn_bbox_pred')(shared) # batch_size,num_anchors,4 # 這個先驗框的調(diào)整參數(shù) rpn_bbox = Reshape([-1,4])(x) return [rpn_class_logits, rpn_probs, rpn_bbox] #------------------------------------# # 建立建議框網(wǎng)絡(luò)模型 # RPN模型 #------------------------------------# def build_rpn_model(anchors_per_location, depth): input_feature_map = Input(shape=[None, None, depth], name="input_rpn_feature_map") outputs = rpn_graph(input_feature_map, anchors_per_location) return Model([input_feature_map], outputs, name="rpn_model") #------------------------------------# # 建立classifier模型 # 這個模型的預(yù)測結(jié)果會調(diào)整建議框 # 獲得最終的預(yù)測框 #------------------------------------# def fpn_classifier_graph(rois, feature_maps, image_meta, pool_size, num_classes, train_bn=True, fc_layers_size=1024): # ROI Pooling,利用建議框在特征層上進(jìn)行截取 # Shape: [batch, num_rois, POOL_SIZE, POOL_SIZE, channels] x = PyramidROIAlign([pool_size, pool_size], name="roi_align_classifier")([rois, image_meta] + feature_maps) # Shape: [batch, num_rois, 1, 1, fc_layers_size],相當(dāng)于兩次全連接 x = TimeDistributed(Conv2D(fc_layers_size, (pool_size, pool_size), padding="valid"), name="mrcnn_class_conv1")(x) x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn1')(x, training=train_bn) x = Activation('relu')(x) # Shape: [batch, num_rois, 1, 1, fc_layers_size] x = TimeDistributed(Conv2D(fc_layers_size, (1, 1)), name="mrcnn_class_conv2")(x) x = TimeDistributed(BatchNormalization(), name='mrcnn_class_bn2')(x, training=train_bn) x = Activation('relu')(x) # Shape: [batch, num_rois, fc_layers_size] shared = Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2), name="pool_squeeze")(x) # Classifier head # 這個的預(yù)測結(jié)果代表這個先驗框內(nèi)部的物體的種類 mrcnn_class_logits = TimeDistributed(Dense(num_classes), name='mrcnn_class_logits')(shared) mrcnn_probs = TimeDistributed(Activation("softmax"), name="mrcnn_class")(mrcnn_class_logits) # BBox head # 這個的預(yù)測結(jié)果會對先驗框進(jìn)行調(diào)整 # [batch, num_rois, NUM_CLASSES * (dy, dx, log(dh), log(dw))] x = TimeDistributed(Dense(num_classes * 4, activation='linear'), name='mrcnn_bbox_fc')(shared) # Reshape to [batch, num_rois, NUM_CLASSES, (dy, dx, log(dh), log(dw))] mrcnn_bbox = Reshape((-1, num_classes, 4), name="mrcnn_bbox")(x) return mrcnn_class_logits, mrcnn_probs, mrcnn_bbox def build_fpn_mask_graph(rois, feature_maps, image_meta, pool_size, num_classes, train_bn=True): # ROI Pooling,利用建議框在特征層上進(jìn)行截取 # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels] x = PyramidROIAlign([pool_size, pool_size], name="roi_align_mask")([rois, image_meta] + feature_maps) # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels] x = TimeDistributed(Conv2D(256, (3, 3), padding="same"), name="mrcnn_mask_conv1")(x) x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn1')(x, training=train_bn) x = Activation('relu')(x) # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels] x = TimeDistributed(Conv2D(256, (3, 3), padding="same"), name="mrcnn_mask_conv2")(x) x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn2')(x, training=train_bn) x = Activation('relu')(x) # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels] x = TimeDistributed(Conv2D(256, (3, 3), padding="same"), name="mrcnn_mask_conv3")(x) x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn3')(x, training=train_bn) x = Activation('relu')(x) # Shape: [batch, num_rois, MASK_POOL_SIZE, MASK_POOL_SIZE, channels] x = TimeDistributed(Conv2D(256, (3, 3), padding="same"), name="mrcnn_mask_conv4")(x) x = TimeDistributed(BatchNormalization(), name='mrcnn_mask_bn4')(x, training=train_bn) x = Activation('relu')(x) # Shape: [batch, num_rois, 2xMASK_POOL_SIZE, 2xMASK_POOL_SIZE, channels] x = TimeDistributed(Conv2DTranspose(256, (2, 2), strides=2, activation="relu"), name="mrcnn_mask_deconv")(x) # 反卷積后再次進(jìn)行一個1x1卷積調(diào)整通道,使其最終數(shù)量為numclasses,代表分的類 x = TimeDistributed(Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid"), name="mrcnn_mask")(x) return x #----------------------------------------------------------# # ROIAlign Layer # 利用建議框在特征層上截取內(nèi)容 #----------------------------------------------------------# def log2_graph(x): return tf.log(x) / tf.log(2.0) def parse_image_meta_graph(meta): """ 將meta里面的參數(shù)進(jìn)行分割 """ image_id = meta[:, 0] original_image_shape = meta[:, 1:4] image_shape = meta[:, 4:7] window = meta[:, 7:11] # (y1, x1, y2, x2) window of image in in pixels scale = meta[:, 11] active_class_ids = meta[:, 12:] return { "image_id": image_id, "original_image_shape": original_image_shape, "image_shape": image_shape, "window": window, "scale": scale, "active_class_ids": active_class_ids, } class PyramidROIAlign(Layer): def __init__(self, pool_shape, **kwargs): super(PyramidROIAlign, self).__init__(**kwargs) self.pool_shape = tuple(pool_shape) def call(self, inputs): # 建議框的位置 boxes = inputs[0] # image_meta包含了一些必要的圖片信息 image_meta = inputs[1] # 取出所有的特征層[batch, height, width, channels] feature_maps = inputs[2:] y1, x1, y2, x2 = tf.split(boxes, 4, axis=2) h = y2 - y1 w = x2 - x1 # 獲得輸入進(jìn)來的圖像的大小 image_shape = parse_image_meta_graph(image_meta)['image_shape'][0] # 通過建議框的大小找到這個建議框?qū)儆谀膫€特征層 image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32) roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area))) roi_level = tf.minimum(5, tf.maximum( 2, 4 + tf.cast(tf.round(roi_level), tf.int32))) # batch_size, box_num roi_level = tf.squeeze(roi_level, 2) # Loop through levels and apply ROI pooling to each. P2 to P5. pooled = [] box_to_level = [] # 分別在P2-P5中進(jìn)行截取 for i, level in enumerate(range(2, 6)): # 找到每個特征層對應(yīng)box ix = tf.where(tf.equal(roi_level, level)) level_boxes = tf.gather_nd(boxes, ix) box_to_level.append(ix) # 獲得這些box所屬的圖片 box_indices = tf.cast(ix[:, 0], tf.int32) # 停止梯度下降 level_boxes = tf.stop_gradient(level_boxes) box_indices = tf.stop_gradient(box_indices) # Result: [batch * num_boxes, pool_height, pool_width, channels] pooled.append(tf.image.crop_and_resize( feature_maps[i], level_boxes, box_indices, self.pool_shape, method="bilinear")) pooled = tf.concat(pooled, axis=0) # 將順序和所屬的圖片進(jìn)行堆疊 box_to_level = tf.concat(box_to_level, axis=0) box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1) box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range], axis=1) # box_to_level[:, 0]表示第幾張圖 # box_to_level[:, 1]表示第幾張圖里的第幾個框 sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1] # 進(jìn)行排序,將同一張圖里的某一些聚集在一起 ix = tf.nn.top_k(sorting_tensor, k=tf.shape( box_to_level)[0]).indices[::-1] # 按順序獲得圖片的索引 ix = tf.gather(box_to_level[:, 2], ix) pooled = tf.gather(pooled, ix) # 重新reshape為原來的格式 # 也就是 # Shape: [batch, num_rois, POOL_SIZE, POOL_SIZE, channels] shape = tf.concat([tf.shape(boxes)[:2], tf.shape(pooled)[1:]], axis=0) pooled = tf.reshape(pooled, shape) return pooled def compute_output_shape(self, input_shape): return input_shape[0][:2] + self.pool_shape + (input_shape[2][-1], )
6、預(yù)測框的解碼
在第四部分獲得的建議框也代表了圖片上的某一些區(qū)域,它在后面的在classifier模型里也起到了先驗框的作用。
也就是classifier模型的預(yù)測結(jié)果,代表了建議框內(nèi)部物體的種類和調(diào)整參數(shù)。
建議框調(diào)整后的結(jié)果,也就是最終的預(yù)測結(jié)果,這個預(yù)測結(jié)果就可以在圖片上進(jìn)行繪制了。
預(yù)測框的解碼過程包括了如下幾個步驟:
1、取出不屬于背景,并且得分大于config.DETECTION_MIN_CONFIDENCE的建議框。
2、然后利用建議框和classifier模型的預(yù)測結(jié)果進(jìn)行解碼,獲得最終預(yù)測框的位置。
3、利用得分和最終預(yù)測框的位置進(jìn)行非極大抑制,防止重復(fù)檢測。
建議框解碼過程的代碼如下:
#----------------------------------------------------------# # Detection Layer #----------------------------------------------------------# def refine_detections_graph(rois, probs, deltas, window, config): """細(xì)化分類建議并過濾重疊部分并返回最終結(jié)果探測。 Inputs: rois: [N, (y1, x1, y2, x2)] in normalized coordinates probs: [N, num_classes]. Class probabilities. deltas: [N, num_classes, (dy, dx, log(dh), log(dw))]. Class-specific bounding box deltas. window: (y1, x1, y2, x2) in normalized coordinates. The part of the image that contains the image excluding the padding. Returns detections shaped: [num_detections, (y1, x1, y2, x2, class_id, score)] where coordinates are normalized. """ # 找到得分最高的類 class_ids = tf.argmax(probs, axis=1, output_type=tf.int32) # 序號+類 indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1) # 取出成績 class_scores = tf.gather_nd(probs, indices) # 還有框的調(diào)整參數(shù) deltas_specific = tf.gather_nd(deltas, indices) # 進(jìn)行解碼 # Shape: [boxes, (y1, x1, y2, x2)] in normalized coordinates refined_rois = apply_box_deltas_graph( rois, deltas_specific * config.BBOX_STD_DEV) # 防止超出0-1 refined_rois = clip_boxes_graph(refined_rois, window) # 去除背景 keep = tf.where(class_ids > 0)[:, 0] # 去除背景和得分小的區(qū)域 if config.DETECTION_MIN_CONFIDENCE: conf_keep = tf.where(class_scores >= config.DETECTION_MIN_CONFIDENCE)[:, 0] keep = tf.sets.set_intersection(tf.expand_dims(keep, 0), tf.expand_dims(conf_keep, 0)) keep = tf.sparse_tensor_to_dense(keep)[0] # 獲得除去背景并且得分較高的框還有種類與得分 # 1. Prepare variables pre_nms_class_ids = tf.gather(class_ids, keep) pre_nms_scores = tf.gather(class_scores, keep) pre_nms_rois = tf.gather(refined_rois, keep) unique_pre_nms_class_ids = tf.unique(pre_nms_class_ids)[0] def nms_keep_map(class_id): ixs = tf.where(tf.equal(pre_nms_class_ids, class_id))[:, 0] class_keep = tf.image.non_max_suppression( tf.gather(pre_nms_rois, ixs), tf.gather(pre_nms_scores, ixs), max_output_size=config.DETECTION_MAX_INSTANCES, iou_threshold=config.DETECTION_NMS_THRESHOLD) class_keep = tf.gather(keep, tf.gather(ixs, class_keep)) gap = config.DETECTION_MAX_INSTANCES - tf.shape(class_keep)[0] class_keep = tf.pad(class_keep, [(0, gap)], mode='CONSTANT', constant_values=-1) class_keep.set_shape([config.DETECTION_MAX_INSTANCES]) return class_keep # 2. 進(jìn)行非極大抑制 nms_keep = tf.map_fn(nms_keep_map, unique_pre_nms_class_ids, dtype=tf.int64) # 3. 找到符合要求的需要被保留的建議框 nms_keep = tf.reshape(nms_keep, [-1]) nms_keep = tf.gather(nms_keep, tf.where(nms_keep > -1)[:, 0]) # 4. Compute intersection between keep and nms_keep keep = tf.sets.set_intersection(tf.expand_dims(keep, 0), tf.expand_dims(nms_keep, 0)) keep = tf.sparse_tensor_to_dense(keep)[0] # 尋找得分最高的num_keep個框 roi_count = config.DETECTION_MAX_INSTANCES class_scores_keep = tf.gather(class_scores, keep) num_keep = tf.minimum(tf.shape(class_scores_keep)[0], roi_count) top_ids = tf.nn.top_k(class_scores_keep, k=num_keep, sorted=True)[1] keep = tf.gather(keep, top_ids) # Arrange output as [N, (y1, x1, y2, x2, class_id, score)] detections = tf.concat([ tf.gather(refined_rois, keep), tf.to_float(tf.gather(class_ids, keep))[..., tf.newaxis], tf.gather(class_scores, keep)[..., tf.newaxis] ], axis=1) # 如果達(dá)不到數(shù)量的話就padding gap = config.DETECTION_MAX_INSTANCES - tf.shape(detections)[0] detections = tf.pad(detections, [(0, gap), (0, 0)], "CONSTANT") return detections def norm_boxes_graph(boxes, shape): h, w = tf.split(tf.cast(shape, tf.float32), 2) scale = tf.concat([h, w, h, w], axis=-1) - tf.constant(1.0) shift = tf.constant([0., 0., 1., 1.]) return tf.divide(boxes - shift, scale) class DetectionLayer(Layer): def __init__(self, config=None, **kwargs): super(DetectionLayer, self).__init__(**kwargs) self.config = config def call(self, inputs): rois = inputs[0] mrcnn_class = inputs[1] mrcnn_bbox = inputs[2] image_meta = inputs[3] # 找到window的小數(shù)形式 m = parse_image_meta_graph(image_meta) image_shape = m['image_shape'][0] window = norm_boxes_graph(m['window'], image_shape[:2]) # Run detection refinement graph on each item in the batch detections_batch = utils.batch_slice( [rois, mrcnn_class, mrcnn_bbox, window], lambda x, y, w, z: refine_detections_graph(x, y, w, z, self.config), self.config.IMAGES_PER_GPU) # Reshape output # [batch, num_detections, (y1, x1, y2, x2, class_id, class_score)] in # normalized coordinates return tf.reshape( detections_batch, [self.config.BATCH_SIZE, self.config.DETECTION_MAX_INSTANCES, 6]) def compute_output_shape(self, input_shape): return (None, self.config.DETECTION_MAX_INSTANCES, 6)
7、mask語義分割信息的獲取
在第六步中,我們獲得了最終的預(yù)測框,這個預(yù)測框相比于之前獲得的建議框更加準(zhǔn)確,因此我們把這個預(yù)測框作為mask模型的區(qū)域截取部分,利用這個預(yù)測框?qū)ask模型中用到的公用特征層進(jìn)行截取。
截取后,利用mask模型再對像素點進(jìn)行分類,獲得語義分割結(jié)果。
二、訓(xùn)練部分
Faster-RCNN訓(xùn)練所用的損失函數(shù)由幾個部分組成,一部分是建議框網(wǎng)絡(luò)的損失函數(shù),一部分是classifier網(wǎng)絡(luò)的損失函數(shù),另一部分是mask網(wǎng)絡(luò)的損失函數(shù)。
1、建議框網(wǎng)絡(luò)的訓(xùn)練
公用特征層如果要獲得建議框的預(yù)測結(jié)果,需要再進(jìn)行一次3x3的卷積后,進(jìn)行一個anchors_per_location x 1通道的1x1卷積,還有一個anchors_per_location x 4通道的1x1卷積。
在Mask R-CNN中,anchors_per_location 也就是先驗框的數(shù)量默認(rèn)情況下是3,所以兩個1x1卷積的結(jié)果實際上也就是:
anchors_per_location x 4的卷積 用于預(yù)測 有效特征層上 每一個網(wǎng)格點上 每一個先驗框的變化情況。**
anchors_per_location x 1的卷積 用于預(yù)測 有效特征層上 每一個網(wǎng)格點上 每一個建議框內(nèi)部是否包含了物體。
也就是說,我們直接利用Mask R-CNN建議框網(wǎng)絡(luò)預(yù)測到的結(jié)果,并不是建議框在圖片上的真實位置,需要解碼才能得到真實位置。
而在訓(xùn)練的時候,我們需要計算loss函數(shù),這個loss函數(shù)是相對于Mask R-CNN建議框網(wǎng)絡(luò)的預(yù)測結(jié)果的。我們需要把圖片輸入到當(dāng)前的Mask R-CNN建議框的網(wǎng)絡(luò)中,得到建議框的結(jié)果;同時還需要進(jìn)行編碼,這個編碼是把真實框的位置信息格式轉(zhuǎn)化為Mask R-CNN建議框預(yù)測結(jié)果的格式信息。
也就是,我們需要找到 每一張用于訓(xùn)練的圖片的每一個真實框?qū)?yīng)的先驗框,并求出如果想要得到這樣一個真實框,我們的建議框預(yù)測結(jié)果應(yīng)該是怎么樣的。
從建議框預(yù)測結(jié)果獲得真實框的過程被稱作解碼,而從真實框獲得建議框預(yù)測結(jié)果的過程就是編碼的過程。
因此我們只需要將解碼過程逆過來就是編碼過程了。
實現(xiàn)代碼如下:
def build_rpn_targets(image_shape, anchors, gt_class_ids, gt_boxes, config): # 1代表正樣本 # -1代表負(fù)樣本 # 0代表忽略 rpn_match = np.zeros([anchors.shape[0]], dtype=np.int32) # 創(chuàng)建該部分內(nèi)容利用先驗框和真實框進(jìn)行編碼 rpn_bbox = np.zeros((config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4)) ''' iscrowd=0的時候,表示這是一個單獨的物體,輪廓用Polygon(多邊形的點)表示, iscrowd=1的時候表示兩個沒有分開的物體,輪廓用RLE編碼表示,比如說一張圖片里面有三個人, 一個人單獨站一邊,另外兩個摟在一起(標(biāo)注的時候距離太近分不開了),這個時候, 單獨的那個人的注釋里面的iscrowing=0,segmentation用Polygon表示, 而另外兩個用放在同一個anatation的數(shù)組里面用一個segmention的RLE編碼形式表示 ''' crowd_ix = np.where(gt_class_ids < 0)[0] if crowd_ix.shape[0] > 0: non_crowd_ix = np.where(gt_class_ids > 0)[0] crowd_boxes = gt_boxes[crowd_ix] gt_class_ids = gt_class_ids[non_crowd_ix] gt_boxes = gt_boxes[non_crowd_ix] crowd_overlaps = utils.compute_overlaps(anchors, crowd_boxes) crowd_iou_max = np.amax(crowd_overlaps, axis=1) no_crowd_bool = (crowd_iou_max < 0.001) else: no_crowd_bool = np.ones([anchors.shape[0]], dtype=bool) # 計算先驗框和真實框的重合程度 [num_anchors, num_gt_boxes] overlaps = utils.compute_overlaps(anchors, gt_boxes) # 1. 重合程度小于0.3則代表為負(fù)樣本 anchor_iou_argmax = np.argmax(overlaps, axis=1) anchor_iou_max = overlaps[np.arange(overlaps.shape[0]), anchor_iou_argmax] rpn_match[(anchor_iou_max < 0.3) & (no_crowd_bool)] = -1 # 2. 每個真實框重合度最大的先驗框是正樣本 gt_iou_argmax = np.argwhere(overlaps == np.max(overlaps, axis=0))[:,0] rpn_match[gt_iou_argmax] = 1 # 3. 重合度大于0.7則代表為正樣本 rpn_match[anchor_iou_max >= 0.7] = 1 # 正負(fù)樣本平衡 # 找到正樣本的索引 ids = np.where(rpn_match == 1)[0] # 如果大于(config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2)則刪掉一些 extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE // 2) if extra > 0: ids = np.random.choice(ids, extra, replace=False) rpn_match[ids] = 0 # 找到負(fù)樣本的索引 ids = np.where(rpn_match == -1)[0] # 使得總數(shù)為config.RPN_TRAIN_ANCHORS_PER_IMAGE extra = len(ids) - (config.RPN_TRAIN_ANCHORS_PER_IMAGE - np.sum(rpn_match == 1)) if extra > 0: # Rest the extra ones to neutral ids = np.random.choice(ids, extra, replace=False) rpn_match[ids] = 0 # 找到內(nèi)部真實存在物體的先驗框,進(jìn)行編碼 ids = np.where(rpn_match == 1)[0] ix = 0 for i, a in zip(ids, anchors[ids]): gt = gt_boxes[anchor_iou_argmax[i]] # 計算真實框的中心,高寬 gt_h = gt[2] - gt[0] gt_w = gt[3] - gt[1] gt_center_y = gt[0] + 0.5 * gt_h gt_center_x = gt[1] + 0.5 * gt_w # 計算先驗框中心,高寬 a_h = a[2] - a[0] a_w = a[3] - a[1] a_center_y = a[0] + 0.5 * a_h a_center_x = a[1] + 0.5 * a_w # 編碼運算 rpn_bbox[ix] = [ (gt_center_y - a_center_y) / a_h, (gt_center_x - a_center_x) / a_w, np.log(gt_h / a_h), np.log(gt_w / a_w), ] # 改變數(shù)量級 rpn_bbox[ix] /= config.RPN_BBOX_STD_DEV ix += 1 return rpn_match, rpn_bbox
利用上述代碼我們可以獲得,真實框?qū)?yīng)的所有的iou較大先驗框,并計算了真實框?qū)?yīng)的所有iou較大的先驗框應(yīng)該有的預(yù)測結(jié)果。
Mask R-CNN會忽略一些重合度相對較高但是不是非常高的先驗框,一般將重合度在0.3-0.7之間的先驗框進(jìn)行忽略。
利用建議框網(wǎng)絡(luò)應(yīng)該有的預(yù)測結(jié)果和實際上的預(yù)測結(jié)果進(jìn)行對比就可以獲得建議框網(wǎng)絡(luò)的loss。
2、Classiffier模型的訓(xùn)練
上一部分提供了RPN網(wǎng)絡(luò)的loss,在Mask R-CNN的模型中,我們還需要對建議框進(jìn)行調(diào)整獲得最終的預(yù)測框。在classiffier模型中,建議框相當(dāng)于是先驗框。
因此,我們需要計算所有建議框和真實框的重合程度,并進(jìn)行篩選,如果某個真實框和建議框的重合程度大于0.5則認(rèn)為該建議框為正樣本,如果重合程度小于0.5則認(rèn)為該建議框為負(fù)樣本
因此我們可以對真實框進(jìn)行編碼,這個編碼是相對于建議框的,也就是,當(dāng)我們存在這些建議框的時候,我們的Classiffier模型需要有什么樣的預(yù)測結(jié)果才能將這些建議框調(diào)整成真實框。
實現(xiàn)代碼如下:
#----------------------------------------------------------# # Detection Target Layer # 該部分代碼會輸入建議框 # 判斷建議框和真實框的重合情況 # 篩選出內(nèi)部包含物體的建議框 # 利用建議框和真實框編碼 # 調(diào)整mask的格式使得其和預(yù)測格式相同 #----------------------------------------------------------# def overlaps_graph(boxes1, boxes2): """ 用于計算boxes1和boxes2的重合程度 boxes1, boxes2: [N, (y1, x1, y2, x2)]. 返回 [len(boxes1), len(boxes2)] """ b1 = tf.reshape(tf.tile(tf.expand_dims(boxes1, 1), [1, 1, tf.shape(boxes2)[0]]), [-1, 4]) b2 = tf.tile(boxes2, [tf.shape(boxes1)[0], 1]) b1_y1, b1_x1, b1_y2, b1_x2 = tf.split(b1, 4, axis=1) b2_y1, b2_x1, b2_y2, b2_x2 = tf.split(b2, 4, axis=1) y1 = tf.maximum(b1_y1, b2_y1) x1 = tf.maximum(b1_x1, b2_x1) y2 = tf.minimum(b1_y2, b2_y2) x2 = tf.minimum(b1_x2, b2_x2) intersection = tf.maximum(x2 - x1, 0) * tf.maximum(y2 - y1, 0) b1_area = (b1_y2 - b1_y1) * (b1_x2 - b1_x1) b2_area = (b2_y2 - b2_y1) * (b2_x2 - b2_x1) union = b1_area + b2_area - intersection iou = intersection / union overlaps = tf.reshape(iou, [tf.shape(boxes1)[0], tf.shape(boxes2)[0]]) return overlaps def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config): asserts = [ tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals], name="roi_assertion"), ] with tf.control_dependencies(asserts): proposals = tf.identity(proposals) # 移除之前獲得的padding的部分 proposals, _ = trim_zeros_graph(proposals, name="trim_proposals") gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes") gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros, name="trim_gt_class_ids") gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2, name="trim_gt_masks") # Handle COCO crowds # A crowd box in COCO is a bounding box around several instances. Exclude # them from training. A crowd box is given a negative class ID. crowd_ix = tf.where(gt_class_ids < 0)[:, 0] non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0] crowd_boxes = tf.gather(gt_boxes, crowd_ix) gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix) gt_boxes = tf.gather(gt_boxes, non_crowd_ix) gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2) # 計算建議框和所有真實框的重合程度 [proposals, gt_boxes] overlaps = overlaps_graph(proposals, gt_boxes) # 計算和 crowd boxes 的重合程度 [proposals, crowd_boxes] crowd_overlaps = overlaps_graph(proposals, crowd_boxes) crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1) no_crowd_bool = (crowd_iou_max < 0.001) # Determine positive and negative ROIs roi_iou_max = tf.reduce_max(overlaps, axis=1) # 1. 正樣本建議框和真實框的重合程度大于0.5 positive_roi_bool = (roi_iou_max >= 0.5) positive_indices = tf.where(positive_roi_bool)[:, 0] # 2. 負(fù)樣本建議框和真實框的重合程度小于0.5,Skip crowds. negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0] # Subsample ROIs. Aim for 33% positive # 進(jìn)行正負(fù)樣本的平衡 # 取出最大33%的正樣本 positive_count = int(config.TRAIN_ROIS_PER_IMAGE * config.ROI_POSITIVE_RATIO) positive_indices = tf.random_shuffle(positive_indices)[:positive_count] positive_count = tf.shape(positive_indices)[0] # 保持正負(fù)樣本比例 r = 1.0 / config.ROI_POSITIVE_RATIO negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count negative_indices = tf.random_shuffle(negative_indices)[:negative_count] # 獲得正樣本和負(fù)樣本 positive_rois = tf.gather(proposals, positive_indices) negative_rois = tf.gather(proposals, negative_indices) # 獲取建議框和真實框重合程度 positive_overlaps = tf.gather(overlaps, positive_indices) # 判斷是否有真實框 roi_gt_box_assignment = tf.cond( tf.greater(tf.shape(positive_overlaps)[1], 0), true_fn = lambda: tf.argmax(positive_overlaps, axis=1), false_fn = lambda: tf.cast(tf.constant([]),tf.int64) ) # 找到每一個建議框?qū)?yīng)的真實框和種類 roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment) roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment) # 解碼獲得網(wǎng)絡(luò)應(yīng)該有得預(yù)測結(jié)果 deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes) deltas /= config.BBOX_STD_DEV # 切換mask的形式[N, height, width, 1] transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1) # 取出對應(yīng)的層 roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment) # Compute mask targets boxes = positive_rois if config.USE_MINI_MASK: # Transform ROI coordinates from normalized image space # to normalized mini-mask space. y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1) gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1) gt_h = gt_y2 - gt_y1 gt_w = gt_x2 - gt_x1 y1 = (y1 - gt_y1) / gt_h x1 = (x1 - gt_x1) / gt_w y2 = (y2 - gt_y1) / gt_h x2 = (x2 - gt_x1) / gt_w boxes = tf.concat([y1, x1, y2, x2], 1) box_ids = tf.range(0, tf.shape(roi_masks)[0]) masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes, box_ids, config.MASK_SHAPE) # Remove the extra dimension from masks. masks = tf.squeeze(masks, axis=3) # 防止resize后的結(jié)果不是1或者0 masks = tf.round(masks) # 一般傳入config.TRAIN_ROIS_PER_IMAGE個建議框進(jìn)行訓(xùn)練, # 如果數(shù)量不夠則padding rois = tf.concat([positive_rois, negative_rois], axis=0) N = tf.shape(negative_rois)[0] P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0) rois = tf.pad(rois, [(0, P), (0, 0)]) roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)]) roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)]) deltas = tf.pad(deltas, [(0, N + P), (0, 0)]) masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)]) return rois, roi_gt_class_ids, deltas, masks def trim_zeros_graph(boxes, name='trim_zeros'): """ 如果前一步?jīng)]有滿POST_NMS_ROIS_TRAINING個建議框,會有padding 要去掉padding """ non_zeros = tf.cast(tf.reduce_sum(tf.abs(boxes), axis=1), tf.bool) boxes = tf.boolean_mask(boxes, non_zeros, name=name) return boxes, non_zeros class DetectionTargetLayer(Layer): """找到建議框的ground_truth Inputs: proposals: [batch, N, (y1, x1, y2, x2)]建議框 gt_class_ids: [batch, MAX_GT_INSTANCES]每個真實框?qū)?yīng)的類 gt_boxes: [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)]真實框的位置 gt_masks: [batch, height, width, MAX_GT_INSTANCES]真實框的語義分割情況 Returns: rois: [batch, TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)]內(nèi)部真實存在目標(biāo)的建議框 target_class_ids: [batch, TRAIN_ROIS_PER_IMAGE]每個建議框?qū)?yīng)的類 target_deltas: [batch, TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw)]每個建議框應(yīng)該有的調(diào)整參數(shù) target_mask: [batch, TRAIN_ROIS_PER_IMAGE, height, width]每個建議框語義分割情況 """ def __init__(self, config, **kwargs): super(DetectionTargetLayer, self).__init__(**kwargs) self.config = config def call(self, inputs): proposals = inputs[0] gt_class_ids = inputs[1] gt_boxes = inputs[2] gt_masks = inputs[3] # 對真實框進(jìn)行編碼 names = ["rois", "target_class_ids", "target_bbox", "target_mask"] outputs = utils.batch_slice( [proposals, gt_class_ids, gt_boxes, gt_masks], lambda w, x, y, z: detection_targets_graph( w, x, y, z, self.config), self.config.IMAGES_PER_GPU, names=names) return outputs def compute_output_shape(self, input_shape): return [ (None, self.config.TRAIN_ROIS_PER_IMAGE, 4), # rois (None, self.config.TRAIN_ROIS_PER_IMAGE), # class_ids (None, self.config.TRAIN_ROIS_PER_IMAGE, 4), # deltas (None, self.config.TRAIN_ROIS_PER_IMAGE, self.config.MASK_SHAPE[0], self.config.MASK_SHAPE[1]) # masks ] def compute_mask(self, inputs, mask=None): return [None, None, None, None]
3、mask模型的訓(xùn)練
mask模型在訓(xùn)練的時候要注意,當(dāng)我們利用建議框網(wǎng)絡(luò)在mask模型需要用到的公用特征層進(jìn)行截取的時候,截取的情況和真實框截下來的不一樣,因此還需要算出來我們用于截取的框相對于真實框的位置,獲得正確的語義分割信息。
使用代碼如下,中間一大部分用于計算真實框相對于建議框的位置。計算完成后利用這個相對位置可以對語義分割信息進(jìn)行截取,獲得正確的語義信息
# Compute mask targets boxes = positive_rois if config.USE_MINI_MASK: # Transform ROI coordinates from normalized image space # to normalized mini-mask space. y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1) gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1) gt_h = gt_y2 - gt_y1 gt_w = gt_x2 - gt_x1 y1 = (y1 - gt_y1) / gt_h x1 = (x1 - gt_x1) / gt_w y2 = (y2 - gt_y1) / gt_h x2 = (x2 - gt_x1) / gt_w boxes = tf.concat([y1, x1, y2, x2], 1) box_ids = tf.range(0, tf.shape(roi_masks)[0]) masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes, box_ids, config.MASK_SHAPE)
這樣的話,就可以通過上述獲得的mask和模型的預(yù)測結(jié)果進(jìn)行結(jié)合訓(xùn)練模型了。
訓(xùn)練自己的Mask-RCNN模型
Mask-RCNN整體的文件夾構(gòu)架如下:
1、數(shù)據(jù)集準(zhǔn)備
本文適合訓(xùn)練自己的數(shù)據(jù)集的同學(xué)使用。首先利用labelme標(biāo)注數(shù)據(jù)。
將其放在before文件夾里:
本文寫了一個labelme到數(shù)據(jù)集的轉(zhuǎn)換代碼,在before外部運行即可。
運行后會生成train_dataset,這個train_dataset放到Mask-RCNN模型的根目錄即可
生成代碼如下:
import argparse import json import os import os.path as osp import warnings import PIL.Image import yaml from labelme import utils import base64 def main(): count = os.listdir("./before/") index = 0 for i in range(0, len(count)): path = os.path.join("./before", count[i]) if os.path.isfile(path) and path.endswith('json'): data = json.load(open(path)) if data['imageData']: imageData = data['imageData'] else: imagePath = os.path.join(os.path.dirname(path), data['imagePath']) with open(imagePath, 'rb') as f: imageData = f.read() imageData = base64.b64encode(imageData).decode('utf-8') img = utils.img_b64_to_arr(imageData) label_name_to_value = {'_background_': 0} for shape in data['shapes']: label_name = shape['label'] if label_name in label_name_to_value: label_value = label_name_to_value[label_name] else: label_value = len(label_name_to_value) label_name_to_value[label_name] = label_value # label_values must be dense label_values, label_names = [], [] for ln, lv in sorted(label_name_to_value.items(), key=lambda x: x[1]): label_values.append(lv) label_names.append(ln) assert label_values == list(range(len(label_values))) lbl = utils.shapes_to_label(img.shape, data['shapes'], label_name_to_value) captions = ['{}: {}'.format(lv, ln) for ln, lv in label_name_to_value.items()] lbl_viz = utils.draw_label(lbl, img, captions) if not os.path.exists("train_dataset"): os.mkdir("train_dataset") label_path = "train_dataset/mask" if not os.path.exists(label_path): os.mkdir(label_path) img_path = "train_dataset/imgs" if not os.path.exists(img_path): os.mkdir(img_path) yaml_path = "train_dataset/yaml" if not os.path.exists(yaml_path): os.mkdir(yaml_path) label_viz_path = "train_dataset/label_viz" if not os.path.exists(label_viz_path): os.mkdir(label_viz_path) PIL.Image.fromarray(img).save(osp.join(img_path, str(index)+'.jpg')) utils.lblsave(osp.join(label_path, str(index)+'.png'), lbl) PIL.Image.fromarray(lbl_viz).save(osp.join(label_viz_path, str(index)+'.png')) warnings.warn('info.yaml is being replaced by label_names.txt') info = dict(label_names=label_names) with open(osp.join(yaml_path, str(index)+'.yaml'), 'w') as f: yaml.safe_dump(info, f, default_flow_style=False) index = index+1 print('Saved : %s' % str(index)) if __name__ == '__main__': main()
2、參數(shù)修改
在數(shù)據(jù)集生成好之后,根據(jù)要求修改train.py文件夾下的參數(shù)即可訓(xùn)練。Num_classes的數(shù)量是分類的總個數(shù)+1。
dataset.py內(nèi)修改自己要分的類,分別是load_shapes函數(shù)和load_mask函數(shù)內(nèi)和類有關(guān)的內(nèi)容,即將原有的circle、square等修改成自己要分的類。
在train文件夾下面修改ShapesConfig(Config)的內(nèi)容,NUM_CLASS等于自己要分的類的數(shù)量+1。
IMAGE_MAX_DIM、IMAGE_MIN_DIM、BATCH_SIZE和IMAGES_PER_GPU根據(jù)自己的顯存情況修改。RPN_ANCHOR_SCALES根據(jù)IMAGE_MAX_DIM和IMAGE_MIN_DIM進(jìn)行修改。
STEPS_PER_EPOCH代表每個世代訓(xùn)練多少次。
3、模型訓(xùn)練
全部修改完成后就可以運行train.py訓(xùn)練了。
以上就是Keras搭建Mask R-CNN實例分割平臺實現(xiàn)源碼的詳細(xì)內(nèi)容,更多關(guān)于Keras搭建Mask R-CNN實例分割的資料請關(guān)注腳本之家其它相關(guān)文章!
相關(guān)文章
在python中計算ssim的方法(與Matlab結(jié)果一致)
這篇文章主要介紹了在python中計算ssim的方法(與Matlab結(jié)果一致),本文通過實例代碼給大家介紹的非常詳細(xì),具有一定的參考借鑒價值,需要的朋友可以參考下2019-12-12