Refined 3D Object Localization with Monocular Camera using Depth Estimation and Geometric Refinement
Abstract
Accurate 3D object localization is a fundamental requirement for applications in industrial robotics, augmented reality, and autonomous navigation. While traditional multi-view systems are precise, their hardware complexity and cost limit widespread adoption. Monocular vision offers a cost-effective alternative but struggles with the inherent challenge of inferring depth from a 2D image, often leading to significant localization errors. This paper presents a novel methodology that overcomes these limitations, achieving high-precision 3D localization using only a single camera. Our proposed framework integrates three synergistic stages. First, a camera calibration process using a checkerboard pattern corrects lens distortion and establishes a real-world metric coordinate system. Second, we employ the YOLOv8 model for real-time 2D object detection and the ZoeDepth network to generate a dense depth map from the monocular input. To mitigate spatial inaccuracies that arise when objects are positioned off-center, we introduce a geometric object position refinement technique. This method adjusts the object’s center based on its projected image coordinates and depth information. Experimental results demonstrate the superiority of our approach, which achieves an average position error of just 4.6mm. This represents a significant 59.29% improvement in localization accuracy compared to using the standalone YOLOv8 detector, showcasing our method's effectiveness for robust 3D localization.