Improved Low-cost 3D Reconstruction Pipeline by Merging Data From Different Color and Depth Cameras

The performance of traditional 3D capture methods directlyinfluences the qualityof digitallyreconstructed 3D models. In order to obtain complete and well-detailed low-cost three-dimensional models, this paper proposes a 3D reconstruction pipeline using point clouds from different sensors, combining captures of a low-cost depth sensor post-processed by Super-Resolution techniques with high-resolution RGB images from an external camera using Structure from Motion and Multi-View Stereo output data. The main contribution of this work includes the description of a complete pipeline that improves the stage of information acquisition and merges data from different sensors. Several phases of the 3D reconstruction pipeline were also specialized to improve the model’s visual quality. The experimental evaluation demonstrates that the developed method produces good and reliable results for low-cost 3D reconstruction of an object.


Introduction
3D reconstruction makes it possible to capture the geometry and appearance of an object or scene, allowing us to inspect details, measure properties, and reproduce 3D models in different materials . In recent years, numerous advances in 3D digitization have been observed, mainly by applying pipelines for threedimensional reconstruction using costly high-precision 3D scanners. In addition, recent researches have sought to reconstruct objects or scenes using depth images from low-cost acquisition devices (e.g., the Microsoft Kinect sensor (Newcombe et al., 2011)) or using Structure from Motion (SfM) (Schonberger and Frahm, 2016) combined with Multi-View Stereo (MVS) (Cernea, 2020) from RGB images.
Good quality 3D reconstructions require a large number of financial resources, as they require state-of-the-art equipment to capture object data in high precision and detail. On the other hand, low-resolution equipment implies lower quality captures, even being financially more viable. Even with the ease of operation, lightweight, and portability, hardware low-cost approaches must consider the limitations of the scanning equipment used (Raimundo, 2018).
The acquisition step of a 3D reconstruction pipeline refers to the use of devices to capture data from objects in a scene, such as their geometry and color (Raimundo and Apaza-Agüero, 2020). One result of 3D geometry capture is the production of discrete points collection that demonstrates the model shape. We call it point cloud. The data obtained by this step will be used in all other phases of the 3D reconstruction process (Bernardini and Rushmeier, 2002).
Active capture methods use equipment such as scanners to infer an object's geometry through a beam of light, inside or outside the visible spectrum. The scanner sensor has the advantages of fast measuring speed, robustness regarding external factors, and ease of acquiring information. Active sensors also have good performance in reconstructing texture-less and featureless surfaces (Chen et al., 2019, Raimundo andApaza-Agüero, 2020). The sensors need to be sensitive to small variations in the information acquired, since, for small differences in distance, the variation in the time it takes to reach two different points is very low, requiring low equipment latency and good response time. For this reason, these systems tend to be slightly noisy . Considering low-cost reconstruction approaches, difficulties to capture color in high precision are a disadvantage (Hernández and Vogiatzis, 2010).
Passive methods are based on optical imaging techniques. They are highly flexible and work well with any modern digital camera. Image-based 3D reconstruction is practical, non-intrusive, low-cost and easily deployable outdoors. Various properties of the images can be used to retrieve the target shape, such as material, viewpoints and illumination. As opposed to active techniques, image-based techniques provide an efficient and easy way to acquire the color of a target object (Hernández and Vogiatzis, 2010). Although passive reconstructions mainly using SfM and MVS produce excellent results, they have limitations like the difficulty of distinguishing the target object from the background (Sergeeva and Sablina, 2018) and require the target object to having detailed geometry (Chen et al., 2019). A controlled environment is needed to obtain better reconstruction results (Hosseininaveh Ahmadabadian et al., 2019, Schonberger andFrahm, 2016).
Considering the limitations imposed by the presented approaches, it is important to note that a target whose geometry has been described by only a low-cost capture method has a real challenge in expressing its completeness, with rich and small details (Chen et al., 2019). This paper proposes a hybrid pipeline from a low-cost depth camera (low-resolution images) and an external color capture camera (digital camera with high-resolution RGB images) to estimate and reconstruct the surface of an object and apply a high-quality texture. The individual limitations imposed by each presented low-cost capture approach are overcome by the proposed pipeline, generating a complete and well-detailed replica of the target model with high visual quality. To achieve this effect, this project uses a variation and combination of Structure from Motion, Multi-View Stereo and depth camera capture techniques.
The main contribution of this work is the description of a low-cost and complete pipeline that makes use of postprocessed depth captures and merging data from different sensors, in which depth sensor data and high-resolution color images do not need to be synchronized.
In addition to this introductory section, this work is organized as follows: Section 2 presents related works, while Section 3 describes the proposed pipeline. The experiments and evaluation of the pipeline are presented in Section 4. Finally, Section 5 discusses the final considerations and results achieved by this research. Prokos et al. (2009) proposed a hybrid approach combining shape from stereo (with additional geometric constraints) and laser scanning techniques. Using two cameras and a portable laser beam, they achieved accuracy as good as some high-end laser triangulation scanners. They do not include automatically detecting outliers in their results.

Related work
The KinectFusion system (Newcombe et al., 2011) tracks the pose of portable depth cameras (Kinect) as they move through space and perform good threedimensional surface reconstructions in real-time. The Kinect sensor has considerable limitations, including temporal inconsistency and the low resolution of the captured color and depth images (Raimundo and Apaza-Agüero, 2020). This approach does not include the texturing step. Silva et al. (2013) provides a guided reconstruction process using Super-Resolution (SR) techniques, helping to increase the quality of the low-resolution data captured with a low-cost depth sensor. The method of data acquisition using low-cost depth cameras and SR is also improved by Raimundo and Apaza-Agüero (2020). Even with depth image improvements, a poor registration of captures can affect the final model's shape. Falkingham (2013) demonstrates the potential applications of low-cost technology in the field of paleontology. The Microsoft Kinect was used to digitalize specimens of various sizes, and the resulting digital models were compared with models produced using SfM and MVS. The work pointed out that although Kinect generally registers morphology at a lower resolution capturing less detail than photogrammetry techniques, it offers advantages in the speed of data acquisition and generation of the 3D mesh completed in real-time during data capture. Also, they did not use Super-Resolution to improve captures from low-cost devices, and the models produced by the Kinect lack any color information. Zollhöfer et al. (2014) used a Kinect sensor to capture the geometry of an excavation site and took advantage of a topographic map to distort the reconstructed model, significantly increasing the quality of the scene. The global distortion, with Super-Resolution techniques applied to raw scans, significantly increased the fidelity and realism of its results but is too specialized for large scale-scenes.
Di Paola and Inzerillo (2018), in order to digitally produce the Egyptian stone from Palermo, proposed a method with a structured light scanner, smartphones and SfM to apply texture in the highly accurate mesh generated by the scanner. The main challenges were the dark color of the material and the superficiality of the groove of the hieroglyphs that some capture approaches have difficulty recognizing. The level of detail of the texture application showed up quite accurately. This reference work used a high-resolution 3D scanner, not aiming for a low-cost reconstruction.
Jo and Hong (2019) use a combination of terrestrial laser scanning and Unmanned Aerial Vehicle (UAV) photogrammetry to establish a three-dimensional model of the Magoksa Temple in Korea. The scans were used to acquire the perpendicular geometry of buildings and locations, being aligned and merged with the photogrammetry output, producing a hybrid point cloud. The photogrammetry adds value to the 3D model, complementing the point cloud with the upper parts of buildings, which are difficult to acquire through laser scanning. Chen et al. (2019) proposes a registration method to combine the data of a laser scanner and photogrammetry to reconstruct the real outdoor 3D scene. They managed to greatly increase the accuracy and convenience of the operation. The two sensors can work independently, as the method fuses their data even if in different scales. Mesh reconstruction and texturing were not explored by this work, they also don't use MVS point clouds in their experiments.
Unlike the related works, the pipeline described in this article includes all reconstruction steps from capture to texturing, focusing on data merging using low-cost equipment.

Pipeline proposal
To overcome limitations of the low-cost threedimensional data acquisition process, such as the low-resolution of depth captures with a low-cost sensor and the need for features for reconstruction by photogrammetry, and taking the advantages of each method individually, the following pipeline is proposed 1 : acquisition of depth and color images (using a low-cost depth sensor and a digital camera); generation of point clouds from low-cost RGB-D camera depth images (using SR techniques (Raimundo and Apaza-Agüero, 2020)); shape estimation from RGB images (using SfM (Schonberger and Frahm, 2016) and MVS (Cernea, 2020)); aligning and merging of data from these different capture techniques; surface reconstruction; and texturing with high-quality photos (Fig. 1).
Several phases of the pipeline were specialized to achieve better accuracy and visual quality of 3D reconstructions of small and medium scale objects. The proposed pipeline works offline, allowing greater freedom in the execution of individual steps. 1 https://github.com/Eberty/LowCost3DReconstruction

Data acquisition
The data acquisition step comprises the capture of depth and color images (raw data), generation of point clouds from low-cost camera depth images and the shape estimation from RGB images (processed data). As the output, this step provides the point clouds used in the next steps of the pipeline.

Low-cost depth captures
For the captures using a low-cost depth sensor, we established the following acquisition procedure: take several depth captures, moving the sensor around the object, and defining the limits of the capture volume. The number of views captured is less than that of real-time approaches due to the additional processing required to ensure the quality of each capture (Raimundo and Apaza-Agüero, 2020). Considering the quality requirements for this proposed work, an interactive tool (Raimundo, 2018) is used to acquire the raw data from the depth sensor (Fig. 2). The depth capture method will present results proportional to the best captures of the device (less noise incidence and best depth accuracy). To achieve this, each depth image, acquired by a low-cost depth sensor, goes through a filtering step with the application of Super-Resolution (Raimundo and Apaza-Agüero, 2020). In order to provide high-resolution information beyond what is possible with a specific sensor, several low-resolution captures are merged, recreating as much detail as possible.

Photogrammetry
In order to add 3D information in greater detail and apply a simple high-quality texturing process, photographs are taken from a digital camera around the target object. In our pipeline, these captures are independent of the depth sensor. We need just to take pictures with the fixed object, in a free movement of the camera. The set of captured images must be sufficient to cover most of the object's surface and the images must portray, in pairs, common parts of it. The color images will be used in the SfM pipeline.
The SfM pipeline detects characteristics in the images (feature detection), mapping these characteristics between images and finding descriptors capable of representing a distinguishable region (feature matching). These descriptors represent vertices of the reconstruction of the 3D scene (sparse reconstruction). The greater the number of matches found between the images, the greater the degree of accuracy of calculating a 3D transformation matrix between the images, providing the estimation of the relative position between camera poses (Hernández andVogiatzis, 2010, Bianco et al., 2018).
Photographs with good resolution and objects with a high level of detail tend to bring greater precision to the photogrammetry algorithms. For objects with fewer details and features, the environment can be used to achieve better results (Schonberger andFrahm, 2016, Chen et al., 2019). In addition to the estimated structure to improve the depth sensor captured geometry, we use these cameras' pose estimation to apply directly texture over the final model surface.
The Multi-View Stereo process is used to improve the point cloud obtained by SfM, resulting in a dense reconstruction. As the camera parameters such as position, rotation, and focal length are known from SfM, the MVS computes 3D vertices in regions not detected by the descriptors. Multi-View Stereo algorithms generally have good accuracy, even with few images (Hernández and  Vogiatzis, 2010). A good evaluation of the performance of different state-of-the-art SfM and MVS implementations is presented by Bianco et al. (2018). For this image-based point cloud result, a crop box filter and a euclidean cluster extraction can be used to highlighting the target object. If the floor below the object is discernible, it is also possible to use a planar segmentation algorithm to remove the plane. A statistical removal algorithm can also be used to remove outliers. Most of the discrepancies and the background are removed using the proposed steps, minimizing working time and human intervention. Details of implementation and application of the algorithms described in this paragraph are presented by Rusu and Cousins (2011).
Although image-based 3D reconstructions get greater detail than using low-cost depth sensors (Falkingham, 2013), this approach may not be able to estimate the completeness of the object (Fig. 3). This is a common result when the captures do not fully describe the target model, or it does not have a very distinguishable texture or detail (Chen et al., 2019).

Normal estimation
The algorithms used in the next steps require a guided set of data, thus, the normals of the point clouds are estimated before performing the alignment step. A normal estimation k-neighbor algorithm (Rusu and Cousins, 2011) is used for this task.

Alignment
The alignment task, usually called Point Cloud Registration, seeks to find the transformations which align two (or more) point sets, placing all captures in a global coordinate system. To find these transformations, the algorithm needs to establish correct correspondences between features present in each point cloud.
The registration is usually performed in a coarse and fine alignment steps. To perform the coarse alignment, we use global alignment algorithms where the pairs of three-dimensional captures are roughly aligned (pairwise incremental registration). A good algorithm for global registration is the Super 4PCS (Mellado et al., 2014). With the captures positioned closer to the correct spot in the real-world representation, a fine adjustment step aims to align the geometric features of the objects. To do this, the Iterative Closest Point (ICP) algorithm (Holz et al., 2015) is used due to its satisfactory performance for the registration problem. This step needs to be carefully parameterized to produce good alignment results due to the nature of the depth data utilized, otherwise it may lead to drifts in the registration (Wang et al., 2016, Raimundo, 2018. SfM approaches use a geometric verification strategy to improve the triangulation method, responsible for finding multiple planes relationship. This strategy performs a more robust camera position estimation, improving the 3D reconstruction and images projection (Schonberger andFrahm, 2016, Bianco et al., 2018). With this and the results from MVS (Cernea, 2020), we use the point cloud obtained by photogrammetry as an auxiliary to apply a new alignment over the depth sensors point clouds, distorting the initial transformation, propagating the accumulation of errors between consecutive alignments and avoiding loop closure problems (Li et al., 2013).
It is important to note that the point cloud generated by the image-based 3D reconstruction pipeline and the ones obtained with the depth sensor captures are created from different image spectrums and are very common to have different scales (Chen et al., 2019). As the depth sensor captures are already in a global coordinate system, to carry out the aligned with the corresponding points of the object in the photogrammetry point cloud, it is sufficient to find a transformation matrix to a single initially aligned depth sensor capture over the MVS point cloud by a manual process or using a scale-based iterative closest point algorithm (scale-based PCA-ICP) (Chen et al., 2019). After finding this matrix, we apply the transformation to all depth sensor point clouds. For better results, the ICP (Holz et al., 2015) algorithm can be applied for each depth sensor point cloud over the photogrammetry output point cloud. The camera positions that we will utilize for texturing will use the photogrammetry model's coordinate system.
The merging of point clouds from both data capture approaches, using an algorithm to accumulate all 3D coordinates described by each point clouds and save as a single cloud, should increase the information that defines the object geometry.
The merged point clouds are also filtered, using a statistical outlier removal algorithm (Rusu and Cousins, 2011) and down-sampled to facilitate visualization, meshing generation, and processing, since the aligned and combined point clouds may have an excessive and redundant number of vertices and there is no guarantee that the sampling density is sufficient for proper reconstruction (Bernardini and Rushmeier, 2002). A voxel grid filter Rusu and Cousins (2011) is used to downsampling the point cloud, joining points close enough. The resulting point cloud is used in the next steps of the pipeline.

Surface reconstruction
The mesh generation step is characterized by the reconstruction of the surface, a process in which a 3D continuous surface is inferred from a collection of discrete points that prove the object's shape (Berger et al., 2017).
For this step, we use the algorithm Screened Poisson Surface Reconstruction (Kazhdan and Hoppe, 2013) from MeshLab. This algorithm seeks to find a surface in which the gradient of its points is the closest to the normals of the vertices of the input point cloud. The choice of this parametric method for the surface reconstruction is justified by the robustness in the geometric fidelity and the possibility of using numerical methods to improve the results. Also, the resulting meshes are almost regular and smooth.

Texture synthesis
Applying textures to reconstructed 3D models is one of the keys to realism (Waechter et al., 2014). High-quality texture mapping aims to avoid seams, smoothing the transition of an image used for applying texture and its adjacent one (Muratov et al., 2016).
The texture synthesis phase of the proposed pipeline comprises the combination of the high-resolution pictures captured with an external digital camera with the integrated model obtained from the surface reconstruction.
The high-resolution photos taken with a digital camera with the poses calculated using SfM, will be used to perform the generation of texture coordinates and atlas of the model, avoiding a time-consuming manual process. For this texturing stage, we used the algorithm proposed by Waechter et al. (2014).
The images with poses from SfM may not be able to apply a texture on faces not visible by any image used for the reconstruction, causing non-textured mesh surfaces in the three-dimensional model. To overcome this limitation, we post-apply the texture, merging camera relative poses result from SfM with new photos, calculating the new poses using photogrammetry result relative coordinate system.
The need for additional photos is determined by non-textured surfaces in the final texturing result and is manually solved using a user interface program like MeshLab's (Cignoni et al., 2008), followed by a mutual information filter (Corsini et al., 2009) for fine adjustments finding and a transformation matrix for the new photo. Note that as this is a post-processed step, the new image of the object can be inserted even if the photo is taken from another environment.
Using a suitable output format, the new camera's pose is added to the output of the SfM module, and the texturing algorithm must be run again.

Experiments and evaluation
For evaluation, we run the proposed pipeline on tabletop objects of varying size and complexity. We present the results for a porcelain horse-shaped object ("Porcelain With the richness of details that this object has, as in the head and saddle, we use the photogrammetry method for distinguishing them with the highest level of detail. At the same time it has a low number of characteristics in predominantly smoothness regions, as the base of the structure and the body of the animal, we use the depth sensor capture approach where this factor does not influence the 3D acquisition process. horse", Fig. 4), a jaguar and a turtle-shaped clay pan replicas ("Jaguar pan", Fig. 5 and "Turtle pan", Fig. 6 respectively). These last two objects mentioned are replicas of cultural objects from the collection of Federal University of Bahia's Brazilian Museum of Archaeology and Ethnology (MAE/UFBA) and were threedimensionally reconstructed and 3D printed. In addition, the turtle replica was colored by hydrographic printing.
In our experiments we used Microsoft Kinect version 1, however, any other low-cost sensor can be used to capture depth images. This sensor is affordable and captures color and depth information with a resolution of 640x480 pixels. To produce point clouds from the low-cost 3D scanner, we used the Super-Resolution approach proposed by Raimundo and Apaza-Agüero (2020) with 16 Low-Resolution (LR) depth frames.
The photos used as input to the passive 3D reconstruction method were taken with a Redmi Note 8 camera for all evaluated models. The number of photos was arbitrarily chosen to maximize the coverage of the object. For the SfM pipeline, the RGB images were processed using COLMAP (Schonberger and Frahm, 2016) to calculate camera poses and sparse shape reconstruction. OpenMVS (Cernea, 2020) was used for dense reconstruction.
Some software tools were developed from third-party libraries for various purposes. For instance, OpenCV (Bradski, 2000) and PCL (Rusu and Cousins, 2011) were used to handle and process depth images and point clouds, libfreenect (OpenKinect, 2012) was used for the depth acquisition application to access and retrieve data from the Microsoft Kinect. MeshLab's system (Cignoni et al., 2008) has been used for Poisson reconstruction and adjustments in 3D point clouds and meshes when necessary.
Details for reproducing the results can be found in the project repository.
Figs. 4 and 5 show the acquisition, merging, and reconstruction steps proposed by this pipeline for the Porcelain Horse and Jaguar Pan respectively. The figures also bring the discussion of the main challenges for each reconstruction and how they were handled by the pipeline. The algorithms and main components of each experiment are described in Table 1.
The resolution of clouds obtained by the low-cost : Jaguar pan replica. Even with some visual characteristics generated by the 3D printing process, the object has very few distinguishable features. This factor makes the reconstruction process by SfM and MVS difficult. With this, we use the environment to assist in detecting the positions and orientation of the cameras. The data captured by the low-cost depth sensor aggregated information where there are no relevant representations in the photogrammetry method, as can be seen at the legs of the jaguar.
sensor, even with SR, is considerably lower than the clouds obtained by photogrammetry and therefore these point clouds, although represent geometry well, don't describe with good precision small object details. The low-cost sensor captures also presented a scale limitation, making it difficult to retrieve the geometry of small objects such as the Turtle replica (Fig. 6). However, it has the advantage of making new captures of the object even if it has moved in the scene. The photogrammetry presented limitations when it tries to describe featureless regions of any object (as shown in Fig. 3 and Fig. 5e). This does not happen with the depth sensor, since the coloring does not influence on the captures. The point clouds obtained by photogrammetry were capable of representing, with good quality, distinguishable details on a millimeter scale. The merging of point clouds was helpful to express in greater detail the objects that were reconstructed, taking the advantages of both captures.
Point clouds were meshed using the Screened Poisson Surface Reconstruction feature in MeshLab (Cignoni et al., 2008) using reconstruction depth 7 and 3 as the minimum number of samples. It is important to note that the production of a mesh is a highly dependent process on the variables used to generate the surface. We consider as standard for all reconstructions the Poisson Surface Reconstruction, the parameters defined in this paragraph.
For quantitative validation, the 3D surfaces reconstructions of the Jaguar were compared with a ground truth, a mesh 3D which was printed and used as the target of our reconstructions (Fig. 7f), making this analysis possible. In order to make an evaluation with previous works, this paper also provides a comparison with the results produced using publicly available implementations based in KinectFusion system, the PCL Kinfu (Rusu and Cousins, 2011) and ReconstructMe (Heindl et al., 2015). For the comparison, we used the one-sided Hausdorff Distance tool of MeshLab (Cignoni et al., 2008). The results are graphically represented on Fig. 7 and discussed on Table 2.
It was observed that the capture approach with super-  shows that the low-cost depth sensor was unable to identify details of the model, this is due to the small size of the object, making it difficult to obtain details, however, this mesh was able to represent the model in all directions, including the bottom. The merged mesh (c) was able to reproduce all the small details found by photogrammetry and include regions that were represented only by depth sensor captures. For comparison, (d) presents the model's ground truth used for 3D printing.  resolution manages to get very close to the real geometry of the object, even losing details, that can be recovered by the MVS approach. The KinectFusion based systems, once not having an adequate treatment of the data, as SR, despite representing the geometry well, is also unable to reproduce small details because of the temporal inconsistency of the Kinect.
All objects evaluated were benefited from the merging of point clouds. For the jaguar pan, the captures with the depth sensor added information in the legs of the jaguar and the belly (bottom) not acquired by photogrammetry. Poisson's surface reconstruction identifies and differentiates nearby geometric details, some of them are added by the merging. For the horse, small depressions in the mouth and eye of the original model were not well recovered in the reconstructed model. Nonetheless, small reliefs of the saddle and mane were well-preserved.
Texturing results using surfaces from merged point clouds are shown in Figs. 4i, 5i and 8. This stage is satisfactory due to the high quality of the images used and from the camera positions correctly aligned and undistorted with the target object from SfM results.

Conclusion
With the proposed pipeline, it is possible to add 3D capture information, reconstructing details beyond what a single low-cost capture method initially provides. A low-cost depth sensor allows preliminary verification of data during acquisition. The Super-Resolution methodology reduces the incidence of noise and mitigates the low amount of details from depth maps acquired using low-cost RGB-D hardware. Photogrammetry, despite capturing a higher level of detail, has certain limitations related to the number of resources, like geometric and feature details. The texturing process that uses high definition images from SfM output and adds possible missing parts, if needed, also helps to achieve greater visual realism to the reconstructed 3D model.
The proposed techniques are not novel, but adapted from known methods. Their combination to compose a low-cost 3D reconstruction is a novelty, and it demonstrates a good surface representation of tabletop objects, even for small details.
The pipeline, despite being robust, has some limitations. If after merging point clouds between different capture methods, it is still not possible to recognize part of the object, the surface reconstruction can stay too smoothed or distorted. If the registration between captures does not reach a desirable alignment result, the surface reconstruction may become bad. The photogrammetry model is usually denser than the Kinect one, which can make it difficult in the alignment and surface reconstruction process. Texturing for regions not covered by photogrammetry needs manual intervention.
Future research involves a quantitative analysis of the 3D reconstruction after the texturing step. It is also projected a better evaluation of the automated alignment of point clouds using the scale-based iterative closest point algorithm (scaled PCA-ICP) and the application of this pipeline to digital preservation of artifacts from the cultural heritage of the MAE/UFBA.