Scalable three-dimensional object recognition in a cross reality system

Saved in:
Bibliographic Details
Title: Scalable three-dimensional object recognition in a cross reality system
Patent Number: 11257,300
Publication Date: February 22, 2022
Appl. No: 16/899878
Application Filed: June 12, 2020
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for scalable three-dimensional (3-D) object recognition in a cross reality system. One of the methods includes maintaining object data specifying objects that have been recognized in a scene. A stream of input images of the scene is received, including a stream of color images and a stream of depth images. A color image is provided as input to an object recognition system. A recognition output that identifies a respective object mask for each object in the color image is received. A synchronization system determines a corresponding depth image for the color image. A 3-D bounding box generation system determines a respective 3-D bounding box for each object that has been recognized in the color image. Data specifying one or more 3-D bounding boxes is received as output from the 3-D bounding box generation system.
Inventors: Magic Leap, Inc. (Plantation, FL, US)
Assignees: Magic Leap, Inc. (Plantation, FL, US)
Claim: 1. A computer-implemented method, the method comprising: maintaining object data specifying objects that have been recognized in a scene in an environment; receiving a stream of input images of the scene, wherein the stream of input images comprises a stream of color images and a stream of depth images; for each of a plurality of color images in the stream of color images: providing the color image as input to an object recognition system; receiving, as output from the object recognition system, a recognition output that identifies a respective object mask in the color image for each of one or more objects that have been recognized in the color image; providing the color image and a plurality of depth images in the stream of depth images as input to a synchronization system that determines a corresponding depth image for the color image based on a timestamp of the corresponding depth image and a timestamp of the color image; providing the object data, the recognition output identifying the object masks, and the corresponding depth image as input to a three-dimensional (3-D) bounding box generation system that determines, from the object data, the object masks, and the corresponding depth image, a respective 3-D bounding box for each of one or more of the objects that have been recognized in the color image; and receiving, as output from the 3-D bounding box generation system, data specifying one or more 3-D bounding boxes for one or more of the objects recognized in the color image; and providing, as output, data specifying the one or more 3-D bounding boxes.
Claim: 2. The method of claim 1 , wherein the 3-D bounding box generation system comprises: a multi-view fusion system that generates an initial set of 3-D object masks.
Claim: 3. The method of claim 2 , wherein the object recognition system, the synchronization system, the multi-view fusion system operate in a stateless manner and independently from one another.
Claim: 4. The method of claim 2 , wherein the multi-view fusion system comprises: an association system that identifies, from the maintained object data, matched object data specifying a corresponding object with the respective object mask of each recognized object in the color image; and a fusion system that generates, for each recognized object in the color image, an initial 3-D object mask by combining the object mask in the color image with the matched object data.
Claim: 5. The method of claim 2 , wherein the 3-D bounding box generation system further comprises an object refinement system that refines the initial set of 3-D object masks to generate an initial set of 3-D bounding boxes.
Claim: 6. The method of claim 2 , wherein the 3-D bounding box generation system further comprises a bounding box refinement system that refines the initial set of 3-D bounding boxes to generate the one or more 3-D bounding boxes.
Claim: 7. The method of claim 1 , wherein the object recognition system comprises a trained deep neural network (DNN) model that takes the color image as input and generates a respective two-dimensional (2-D) object mask for each of the one or more objects that have been recognized in the color image.
Claim: 8. The method of claim 1 , wherein determining, by the synchronization system, a corresponding depth image for the color image based on timestamps of the corresponding depth images and timestamp of the color image comprises: identifies a candidate depth image which has a closest timestamp to the timestamp of the color image; determining that a time difference between the candidate depth image and the color image is less than a threshold; and in response, determining the candidate depth image as the corresponding depth image for the color image.
Claim: 9. The method of claim 1 , wherein the 3-D bounding box generation system determines, from the object masks and the corresponding depth image, a respective 3-D object mask for each of the one or more of the objects that have been recognized in the color image, and wherein the method further comprises: receiving, as output from the 3-D bounding box generation system, data specifying one or more 3-D object masks for the one or more of the objects recognized in the color image; and providing, as output, data specifying the one or more 3-D object masks.
Claim: 10. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising: maintaining object data specifying objects that have been recognized in a scene in an environment; receiving a stream of input images of the scene, wherein the stream of input images comprises a stream of color images and a stream of depth images; for each of a plurality of color images in the stream of color images: providing the color image as input to an object recognition system; receiving, as output from the object recognition system, a recognition output that identifies a respective object mask in the color image for each of one or more objects that have been recognized in the color image; providing the color image and a plurality of depth images in the stream of depth images as input to a synchronization system that determines a corresponding depth image for the color image based on a timestamp of the corresponding depth image and a timestamp of the color image; providing the object data, the recognition output identifying the object masks, and the corresponding depth image as input to a three-dimensional (3-D) bounding box generation system that determines, from the object data, the object masks, and the corresponding depth image, a respective 3-D bounding box for each of one or more of the objects that have been recognized in the color image; and receiving, as output from the 3-D bounding box generation system, data specifying one or more 3-D bounding boxes for one or more of the objects recognized in the color image; and providing, as output, data specifying the one or more 3-D bounding boxes.
Claim: 11. The system of claim 10 , wherein the 3-D bounding box generation system comprises a multi-view fusion system that generates an initial set of 3-D object masks, wherein the object recognition system, the synchronization system, the multi-view fusion system operate in a stateless manner and independently from one another.
Claim: 12. The system of claim 11 , wherein the multi-view fusion system comprises: an association system that identifies, from the maintained object data, matched object data specifying a corresponding object with the respective object mask of each recognized object in the color image; and a fusion system that generates, for each recognized object in the color image, an initial 3-D object mask by combining the object mask in the color image with the matched object data.
Claim: 13. The system of claim 11 , wherein the 3-D bounding box generation system further comprises an object refinement system that refines the initial set of 3-D object masks to generate an initial set of 3-D bounding boxes.
Claim: 14. The system of claim 11 , wherein the 3-D bounding box generation system further comprises a bounding box refinement system that refines the initial set of 3-D bounding boxes to generate the one or more 3-D bounding boxes.
Claim: 15. The system of claim 10 , wherein the object recognition system comprises a trained deep neural network (DNN) model that takes the color image as input and generates a respective two-dimensional (2-D) object mask for each of the one or more objects that have been recognized in the color image.
Claim: 16. The system of claim 10 , wherein determining, by the synchronization system, a corresponding depth image for the color image based on timestamps of the corresponding depth images and timestamp of the color image comprises: identifies a candidate depth image which has a closest timestamp to the timestamp of the color image; determining that a time difference between the candidate depth image and the color image is less than a threshold; and in response, determining the candidate depth image as the corresponding depth image for the color image.
Claim: 17. The system of claim 10 , wherein the 3-D bounding box generation system determines, from the object masks and the corresponding depth image, a respective 3-D object mask for each of the one or more of the objects that have been recognized in the color image, and wherein the operations further comprise: receiving, as output from the 3-D bounding box generation system, data specifying one or more 3-D object masks for the one or more of the objects recognized in the color image; and providing, as output, data specifying the one or more 3-D object masks.
Claim: 18. A computer program product encoded on one or more non-transitory computer readable media, the computer program product comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: maintaining object data specifying objects that have been recognized in a scene in an environment; receiving a stream of input images of the scene, wherein the stream of input images comprises a stream of color images and a stream of depth images; for each of a plurality of color images in the stream of color images: providing the color image as input to an object recognition system; receiving, as output from the object recognition system, a recognition output that identifies a respective object mask in the color image for each of one or more objects that have been recognized in the color image; providing the color image and a plurality of depth images in the stream of depth images as input to a synchronization system that determines a corresponding depth image for the color image based on a timestamp of the corresponding depth image and a timestamp of the color image; providing the object data, the recognition output identifying the object masks, and the corresponding depth image as input to a three-dimensional (3-D) bounding box generation system that determines, from the object data, the object masks, and the corresponding depth image, a respective 3-D bounding box for each of one or more of the objects that have been recognized in the color image; and receiving, as output from the 3-D bounding box generation system, data specifying one or more 3-D bounding boxes for one or more of the objects recognized in the color image; and providing, as output, data specifying the one or more 3-D bounding boxes.
Claim: 19. The non-transitory computer readable media of claim 18 , wherein the 3-D bounding box generation system comprises a multi-view fusion system that generates an initial set of 3-D object masks, wherein the object recognition system, the synchronization system, the multi-view fusion system operate in a stateless manner and independently from one another.
Claim: 20. The non-transitory computer readable media of claim 19 , wherein the multi-view fusion system comprises: an association system that identifies, from the maintained object data, matched object data specifying a corresponding object with the respective object mask of each recognized object in the color image; and a fusion system that generates, for each recognized object in the color image, an initial 3-D object mask by combining the object mask in the color image with the matched object data.
Patent References Cited: 9158972 October 2015 Datta et al.
9251598 February 2016 Wells et al.
9542626 January 2017 Martinson et al.
2010/0201871 August 2010 Zhang et al.
2016/0196659 July 2016 Vrcelj
2017/0011281 January 2017 Dijkman et al.
2017/0228940 August 2017 Kutliroff
2017/0243352 August 2017 Kutliroff









Other References: Fischler et al., “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Communications of the ACM, Jun. 1981, 24(6):381-395. cited by applicant
He, et al. “Mask R-CNN ” Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct. 22-29, 2017, Venice, Italy, 2980-2988. cited by applicant
Liu et al., “SSD: Single Shot MultiBox Detector,” arXiv, Dec. 29, 2016, arXiv:1512.02325v5, 17 pages. cited by applicant
Munkres, “Algorithms for the Assignment and Transportation Problems.” J. Soc. Indust. Appl. Mathematics, Mar. 1957, 5(1):32-38. cited by applicant
Neubeck et al., “Efficient Non-Maximum Suppression,” Proceedings of the 18th International Conference on Pattern Recognition (ICPR '06), Aug. 20-24, 2006, Hong Kong, 3:850-855. cited by applicant
PCT International Search Report and Written Opinion in International Appln. No PCT/US2020/037573, dated Aug. 12, 2020, 10 pages. cited by applicant
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 27-30, 2016, Las Vegas, Nevada, UA, 779-788. cited by applicant
Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS '15), Dec. 7-12, 2015, Montreal, Canada, 91-99. cited by applicant
Rubino et al., “3D Object Localisation from Multi-view Image Detections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, May 4, 2017, 40(6):1281-1294. cited by applicant
Samet et al., “Efficient Component Labeling of Images of Arbitrary Dimension Represented by Linear Bintrees,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Jul. 1988, 10(4):579-586. cited by applicant
Primary Examiner: Wu, Yanna
Attorney, Agent or Firm: Fish & Richardson P.C.
Accession Number: edspgr.11257300
Database: USPTO Patent Grants
Description
Abstract:Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for scalable three-dimensional (3-D) object recognition in a cross reality system. One of the methods includes maintaining object data specifying objects that have been recognized in a scene. A stream of input images of the scene is received, including a stream of color images and a stream of depth images. A color image is provided as input to an object recognition system. A recognition output that identifies a respective object mask for each object in the color image is received. A synchronization system determines a corresponding depth image for the color image. A 3-D bounding box generation system determines a respective 3-D bounding box for each object that has been recognized in the color image. Data specifying one or more 3-D bounding boxes is received as output from the 3-D bounding box generation system.