In EDS ansehen

Deep learning based robot target recognition and motion detection method, storage medium and apparatus

Gespeichert in:

Bibliographische Detailangaben
Titel:	Deep learning based robot target recognition and motion detection method, storage medium and apparatus
Patent Number:	11763,485
Publikationsdatum:	September 19, 2023
Appl. No:	18/110544
Application Filed:	February 16, 2023
Abstract:	The present invention discloses deep learning based robot target recognition and motion detection methods, storage media and devices, the method consists of the following steps: Step S1. adding masks to regions where potentially dynamic objects are located through instance segmentation networks incorporating attention mechanisms and positional coding; Step S2, estimation of the camera pose using static feature points outside the instance segmentation mask in the scene; Step S3, estimation of the object pose transformation matrix from the camera pose; Step S4, determining the state of motion of the object's characteristic points from the relationship between motion parallax and differential entropy, and thus the state of motion of the object as a whole; Step S5, rejects the dynamic objects therein and repairs the static background of the rejected area for positional estimation and map construction. The invention improves the accuracy of segmented boundaries of occluded dynamic objects, and the rejection of dynamic region feature points reduces the impact of dynamic objects on the system.
Inventors:	Anhui University of Engineering (Wuhu, CN)
Assignees:	ANHUI UNIVERSITY OF ENGINEERING (Wuhu, CN)
Claim:	1. A deep learning based method for robot target recognition and motion detection, comprising the steps of: adding masks to regions where potentially dynamic objects are located through instance segmentation networks incorporating attention mechanisms and positional coding; estimating a camera pose using static feature points outside a instance segmentation mask in the scene; estimating an object pose transformation matrix from the camera pose as the camera pose and an object pose projection are coupled in the same image; finding a median motion parallax of all points on a potentially moving object, obtaining a differential entropy of motion uncertainty according to a positional optimization process, and determining a motion state of the object's characteristic points from the relationship between the median motion parallax and the differential entropy, so as to determine the motion state of a whole object; eliminating the dynamic objects, repairing the static background of the eliminated area, and filtering high quality feature points according to information entropy and cross-entropy for positional estimation and map construction; wherein in the step of adding the masks, a multi-attention module consists of two sub-networks, a channel attention mechanism and a spatial attention mechanism, which connect an input feature map F in a channel dimension and a spatial dimension respectively, and concatenate a corresponding acquired channel dimension feature map F′ with a spatial dimension feature map F″ to obtain an output F′″, in order to enhance a pixel weight of an obscured object part and improve a recognition rate of the obscured object, and channel attention mechanism works by assigning weights to each layer of channels in the feature map, while the spatial attention mechanism works by increasing a weight of pixel values at occluded locations in the feature map, continuously adjusting each weight value after learning, and then directing the network to focus on an area where an occluded part is located, thus adding a mask to the area where the potentially dynamic objects are located; wherein in the step of adding the masks, the H×W×C feature map F is input to the channel attention mechanism, and the feature map is subjected to average pooling and maximum pooling operations to obtain information about each channel of the feature map, features F avg and F max obtained through average-pooling and max-pooling are subjected to a fully connected layer FC module to strengthen a correlation between channels and to reallocate the weights of each layer of the channels for better learning of occlusion features, an output f v , obtained through the channel attention mechanism, is calculated as shown below: f v =σ((F avg +F max)ηβ) where, σ denotes a Sigmoid function, η denotes a ReLU function, β is a parameter for a fully connected layer, finally, a channel dimensional feature map F′ is obtained by layer-by-layer channel weighting of the input feature map F using f v , with H, W and C denoting height, width and number of channels respectively; wherein in the step of adding the masks, the input feature map F is also fed into the spatial attention mechanism, and after average-pooling and max-pooling for concat fusion to form H×W×2 feature map f c , and then processed by 3×3×1 convolution layer and the Sigmoid function to obtain a spatial attention map f u , which is calculated as shown below: f u =σ(c (f c)) where, f u is the spatial attention map, f c is an H×W×2 feature map, σ denotes the Sigmoid function, c is a 3×3×1 convolutional network, connect f u to the input feature map F to obtain the spatial dimensional feature map F″ weighted by spatial attention; wherein a relative position encoding algorithm is proposed in said step the step of adding the masks, which uses dot product to calculate a correlation fraction e ij between input elements, which is calculated as shown below: [mathematical expression included] where, e ij is a correlation score between the input elements, σ is a trainable parameter with an initial value of 1, ρ∈ is a two-dimensional relative position weights, and interacts with query parameters in a transformer network; W Q , W K are trainable parameter matrices; P i , P j are inputs to the image block, i and j are an input image block numbers, d z denote the output matrix dimensions; incorporating relative position coding into transformer networks to build a fused relative position coding transformer module, enhancing boundary semantic information between occluded and occluded objects by reassigning pixel weights through inter-pixel distances to improve accuracy of segmented boundaries of occluded dynamic objects, wherein in the step of estimating the camera pose, robot in real-time operation, with known camera calibration parameters and feature point depths, associates static point m in space is associated from a reference frame F k−1 to a latter frame F k , which is calculated as: m k =Δ[H c Δ −1 I k−1 (m k−1 ,d k−1)] where, Δ and Δ −1 correspond to a projection function and an inverse projection function respectively, the function is composed of the camera's internal and external parameters, H c ∈SE(3) is a relative transformation matrix of the camera pose, SE(3) is a Lie Group matrix; I k−1 is a projection of a static point in space onto a 3D point in F k−1 , coordinates are (m k−1 ,d k−1), where m k−1 is a 2D pixel coordinate of a point in frame F k−1 , d k−1 is a depth of the point in the frame F k−1 ; m k is the 2D pixel coordinates of a spatially static point projected into F k ; wherein the camera pose is obtained by calculating the reprojection error as follows: e (H c)= m k ′−Δ[I k−1 (m k−1 ,d k−1)Δ H c exp(h c)] where, e(H c) is the reprojection error of H c , H c ∈SE(3) is the relative transformation matrix of the camera position and pose, h c ∈se(3) is a relative transformation vector of camera position and pose, obtained from an H c transformation; I k−1 is the projection of object feature points onto 3D points in F k−1 , where m k−1 is the 2D pixel coordinates in frame F k−1 , d k−1 is the depth of the point in frame F k−1 ; m k ′ is the 2D pixel coordinate m k−1 in the previous frame F k−1 2D pixel coordinates projected onto the current frame, Δ and Δ −1 correspond to the projection function and the inverse projection function respectively, exp(⋅) is a transformation from a Lie algebraic vector transformation to a Lie group matrix 3D transformation; for defining {tilde over (h)} c ∈ as a symbolic operation that maps from se(3) to a least squares solution {tilde over (h)} c * is shown below, [mathematical expression included] where, ρ h is a penalty factor, Σ p is a covariance matrix of the reprojection error, n is the projection of the 3D points required for a residual operation into 2D number of points, e(h c) is the reprojection error of h c , the relative transformation matrix H c of the camera pose is obtained by solving for the h c transformation, by the camera pose is obtained by optimizing the solution; wherein in the step of estimating the object pose transformation matrix, estimation of the object position and pose transformation matrix H c ∈SE(3) from a camera motion, modelling of a potentially dynamic object as an entity with a position and pose transformation matrix H o , associating dynamic points in space from reference frame F k−1 to the next frame F k , the calculation is The calculation is as follows: {tilde over (m)} k =Δ[H c H o Δ −1 I k−1 ′({tilde over (m)} k−1 ,{tilde over (d)} z−1)] where, H c ∈SE(3) is the camera motion estimated object position and pose transformation matrix, H o ∈SE(3) is the relative transformation of the pose of the object matrix, I k−1 ′ is a dynamic point {tilde over (m)} in space projected onto the 3D point in frame F k−1 , {tilde over (m)} k−1 is the 2D pixel coordinate in depth image frame F k−1 , {tilde over (d)} z−1 is the depth of the coordinate point in frame F k−1 , {tilde over (m)} k is the 2D point coordinate of the point {tilde over (m)} in the frame F k , Δ and Δ −1 correspond to the projection function and the inverse projection function, the function is composed of camera internal and camera external parameters; the object posture transformation matrix H o is obtained by reprojection error and least squares calculation as follows: e (H o)= {tilde over (m)} k ′−Δ[H c H o Δ −1 I k−1 ′({tilde over (m)} k−1 , d z−1)] where, e(H o) is the reprojection error, h o ∈se(3) is the vector of relative transformations of the object's pose, obtained from the H o transformation, n b is a relative number of 3D points to be projected to 2D points for a corresponding residual operation, {tilde over (m)} k ′ is the 2D pixel coordinate {tilde over (m)} k−1 of the previous frame F k−1 projected onto the 2D pixel coordinate of a current frame, exp(⋅) is the transformation from the Lie algebraic vector transformation to the Lie group matrix; the method deriving the object transformation matrix by minimising an error value; wherein in the step of finding the median motion parallax, a two-dimensional image measurement is used to determine a state of the object, assume that the feature point {tilde over (m)} k−1 is a static projection point, then a pixel distance d between the static projection point and its true projection point, {tilde over (m)} k is a dynamic visual error, a median L of a dynamic visual error d of the pixel points on the potential dynamic object of the image is calculated and expressed as the dynamic visual error of the object; L as shown below: L =med{d}=med{∥{tilde over (m)} k−1 ′,{tilde over (m)} k ∥} wherein in a nonlinear pose optimization phase, an uncertainty error is set to satisfy a K-dimensional Gaussian distribution, and its differential entropy is calculated as below: G (x 0)=log 2 w √{square root over (Π −1 Σ r (Π −1) T (2π e) u)} where, G(x 0) is the differential entropy, x 0 is an input quantity, w is a probability of movement obtained from a propagation of the previous frame, Π∈ is a derivative of a residual equation, Σ r ∈ is a covariance matrix, r∈ is a photometric reprojection error, μ denotes a K Gaussian distribution dimension; based on an object dynamic deviation is compared to a dynamic threshold Δd=H(G(x)) guided by the differential entropy and slowly increasing with the differential entropy, H(G(x)) is the function constructed for this purpose, if L >Δd determines that the object is a dynamic object.
Claim:	2. The deep learning based method for robot target recognition and motion detection according to claim 1 , said step of eliminating the dynamic objects starts with a keyframe F t to be repaired, keyframe images are aligned with the keyframe images to be repaired in order according to a grid flow between the two frames, and when all keyframe images are aligned with the keyframe images to be repaired, the keyframe images to be repaired are a missing area of a keyframe image to be repaired is indexed forward to a corresponding pixel, if one corresponding pixel is indexed forward, a missing area pixel is directly if more than one corresponding pixel value is indexed, the missing area pixel is averaged over pixels indexed and then the missing area pixel is filled in.
Claim:	3. A non-transitory computer readable storage medium having a computer program stored thereon, wherein said computer program when executed by a processor implements the steps of the deep learning based robot target recognition and motion detection method as claimed in claim 1 .
Claim:	4. A computer device comprising a memory, a processor and a computer program stored in the memory and runnable on the processor, wherein said processor implements the steps of a deep learning based robot target recognition and motion detection method as described in claim 1 when executing said computer program.
Patent References Cited:	20180189573 July 2018 Divakaran 20200285247 September 2020 Tan 20210019897 January 2021 Biswas et al. 20220161422 May 2022 Chen 20230092774 March 2023 Velardo 112132897 December 2020 112991447 June 2021
Other References:	Hu et al., “Dynamic Object Segmentation Based on Mark R-CNN Apply in RGB-D SLAM,” Industrial Control Computer Issue 3, 2020, pp. 15-17, 3 pages. cited by applicant
Primary Examiner:	Goradia, Shefali D
Attorney, Agent or Firm:	MUNCY, GEISSLER, OLDS & LOWE, P.C.
Dokumentencode:	edspgr.11763485
Datenbank:	USPTO Patent Grants

View record in USPTO Patent Grants

Beschreibung
Abstract:	The present invention discloses deep learning based robot target recognition and motion detection methods, storage media and devices, the method consists of the following steps: Step S1. adding masks to regions where potentially dynamic objects are located through instance segmentation networks incorporating attention mechanisms and positional coding; Step S2, estimation of the camera pose using static feature points outside the instance segmentation mask in the scene; Step S3, estimation of the object pose transformation matrix from the camera pose; Step S4, determining the state of motion of the object's characteristic points from the relationship between motion parallax and differential entropy, and thus the state of motion of the object as a whole; Step S5, rejects the dynamic objects therein and repairs the static background of the rejected area for positional estimation and map construction. The invention improves the accuracy of segmented boundaries of occluded dynamic objects, and the rejection of dynamic region feature points reduces the impact of dynamic objects on the system.