Improved point clouds from a heritage artifact depth low-cost acquisition

Data acquired using low-cost depth cameras exhibits undesirable traits such as low-resolution and high amount of noise, yielding point clouds with insu cient information from an object. This limits the use of such devices for 3D reconstruction of heritage artifacts. This work aims to improve low-cost depth acquisition by using a new approach based on Super-Resolution techniques. The proposed approach has been applied to several artifacts of the Federal University of Bahia Museum of Archaeology and Ethnology (MAE/UFBA). The results show that our approach improves the quality of point clouds generated from several heritage artifacts. Our analysis concludes that whenever additional geometry is obtained via the proposed method there is actual reconstruction of detail, while any geometry that is removed is usually related to the removal of inconsistencies or noise from the input data, without loss of detail.


Introduction
Projects such as the Digital Michelangelo (Levoy et al., 2000) and the Great Buddha (Miyazaki et al., 2000) show examples of successful 3D reconstructions to digital heritage.
However, as was the case with these two projects, digital heritage missions frequently employ highly specialized, and costly, equipment and personnel. Nevertheless, some recent reconstruction pipelines focus on using low-cost acquisition technologies , Dias et al., 2006. Potential advantages of these lowcost approaches include the possibility of real-time reconstruction; GPU acceleration; and cheap, portable and lightweight equipment. The di erence in price between low-cost and traditional 3D acquisition is of at least several orders of magnitude. An Intel Realsense low-cost camera costs in range of US$100 and a Microsoft Kinect V1 (now discontinued) costs around US$50, whereas traditional laser scanners can cost from US$10,000 to more than US$100,000.
The Kinect V1 is the rst version of a cheap motiontracking device based on a infrared emitter/receiver pair. The Kinect works by projecting a pattern of structured infrared light (Pavlidis et al., 2007) upon the scene, much like older low-cost reconstruction pipelines (Rocchini et al., 2001). The infrared receiver detects the distortions caused by the reconstructed scene on the pattern; based on these distortions and the known parameters of the acquisition hardware (distance between the sensors, etc.), it is possible to obtain the distance of the detected points to the infrared emitter and obtain depth images of the scene at a rate of 30 frames per second. Moreover, the Kinect V1 has an RGB camera which is used to capture color information at the same rate of the depth images.
The main di erence between using a low-cost sensor and high-end acquisition hardware lies in the quality of the acquired data. The data provided by the Kinect scanner is noisy, low-resolution (both the depth maps and color images are captured at 640x480 resolution) and inconsistent (pieces of information appear and disappear even in successive frames). Cheaper acquisition hardware also indirectly limits the scale of the reconstruction targets, which is constrained by factors such as scanning resolution and operational range.
In this context, the central contribution of the present research is a new approach, based on Super-Resolution (SR) techniques (Nasrollahi and Moeslund, 2014), to enhance the 3D data obtained from a low-cost device. Our approach generates a high-resolution (HR) depth image from low-resolution (LR) depth images of the object. The associated 3D reconstruction pipeline processes this HR data and generates a point cloud with more detail and less noise compared to clouds generated from the original LR observations. The approach has been applied in a practical context of digital heritage, capturing several cultural objects of the UFBA Museum of Archaeology and Ethnology.
An important observation is that even though a Microsoft Kinect V1 was used during the acquisition phase of this work, the method is not tied to a speci c depth sensor. Therefore, even better results could be possible with newer and more accurate low-cost acquisition devices such as the Microsoft Kinect V2 and the Intel Realsense depth cameras. Moreover, it should be noted that while several studies work with 3D meshes and point clouds directly, our method treats the acquired data as a grayscale 16-bit 2D image during the acquisition stage. This allows the usage of 2D image processing techniques instead of 3D geometry processing algorithms for tasks such as noise removal and registration, which ultimately yielded good results. Extra care was taken to never quantize the input data using less than 16 bits, thus preserving its original precision.
The rest of this work is structured as follows: Section 2 shows the previous work focused on improving data quality from low-cost acquisitions; Section 3 details the proposed approach based on Super-Resolution; Section 4 presents experimental results, showing some cultural heritage artifacts used in this work; nally, Section 5 presents the conclusions of this work and some directions for future research.

Related Work
The acquisition phase of a 3D reconstruction pipeline is concerned with using one or more acquisition devices to capture the data of an object (geometry, color, etc.) that will be used throughout the other phases of a 3D reconstruction process (Bernardini and Rushmeier, 2002). For the acquisition of geometry, there is a variety of hardware with considerable di erences. When selecting which device to use, the acquisition devices have to be compared regarding characteristics such as precision, exibility, reconstruction speed, portability, and operational scale (Gomes et al., 2014). Regarding low-cost acquisition technologies, there are both active (which project some sort of light upon the scene) and passive (which use image data captured by RGB images) capture approaches, this work focuses on the former. Some arguments for the usage of active low-cost 3D acquisition techniques are: a) active sensors provide much faster captures because the depth information is calculated from one or more physical measurements instead of relying on image processing and feature matching techniques; b) active sensors are more robust regarding external factors such as changes in lighting and focal length; and c) active sensors perform better in reconstructing textureless and featureless surfaces.
A survey on low-cost 3D reconstruction of cultural heritage artifacts has already been proposed (Raimundo et al., 2018).
Nevertheless, due to limitations of low-cost depth cameras, the raw data that they provide is usually noisy, low-resolution and inaccurate (Silva et al., 2013, Cui et al., 2013. A way to improve the quality of sensor data shown in several studies (Zollhöfer et al., 2015, Silva et al., 2013 is the utilization of SR techniques (Park et al., 2003, Richardt et al., 2012. A comprehensive survey and a thorough taxonomy of this area have been proposed (Nasrollahi and Moeslund, 2014). Super-Resolution is the process of obtaining HR images from one or more LR observations of the same object, where one or several parameters of the imaging model (position, focal length, noise model, etc.) vary between the LR images (Nasrollahi and Moeslund, 2014). In this context, resolution can refer to one or more image characteristics such as spatial resolution or temporal resolution. With that said, like most SR approaches (Nasrollahi andMoeslund, 2014, Huang andYang, 2010), the proposed SR method aims to improve the spatial resolution of the images, increasing the amount of high-frequency information (i.e. object detail), by varying the position of the depth camera slightly between the captures. This is di erent from simple image interpolation because the latter only increases the amount of pixels of the input image, without seeking to reconstruct detail.
The KinectFusion system  performs real-time 3D reconstruction using the Microsoft Kinect V1 as its acquisition hardware. KinectFusion showed a way to improve the quality of raw data by applying a bilateral lter (Tomasi and Manduchi, 1998) to remove noise from the input data while preserving its edges and ne detail, providing cleaner input to the other steps of the reconstruction process. Nevertheless, even after applying a bilateral lter to the depth image, holes or missing detail could be generated because to the temporal inconsistency of Kinect. Also, given that real-time reconstruction reduces the viability of our approach and is not a requirement for heritage artifact reconstruction, this constraint is removed in the present work to allow the usage of computationally intensive Super-Resolution techniques to improve the quality of the data of captured cultural artifacts as much as possible.

Proposed Approach
For each captured artifact an acquisition protocol consisting of the number of depth captures, the angular displacement between captures, and the boundaries of the capture volume is established. Considering the strict quality requirements associated with digital heritage, it is also necessary to work directly with the sensor data instead of processed or ltered data provided by existing capture tools. Thus, an interactive tool ( Fig. 1) which captures depth and color images and generates point clouds was developed to acquire the raw data from the depth sensor. This application can also be con gured to capture a user-de ned number of images in a single burst.
In an incremental fashion, two di erent techniques to solve the problems of low-resolution and noise present in data from low-cost 3D scanners were developed: Smooth Accumulation and Super-Resolution, both of which leverage the burst-capture functionality of the developed tool. While Smooth Accumulation was eventually replaced by a custom SR approach, it was used in several case studies and some of its ideas were reused, such as the utilization of more than one depth frame to improve the acquired depth image.

Smooth Accumulation
Initial assessments of the depth data showed that even after applying a bilateral lter (Tomasi and Manduchi, 1998) to the depth map, several holes were present and part of the geometric information was missing because of the temporal inconsistency of the depth stream. To tackle this issue, data from multiple depth frames was accumulated, maintaining information that was temporarily absent due to uctuations in the sensor measurements. Eq. (1) formalizes this accumulation technique for two grayscale input images A and B to obtain a third image (C); as a convention, right  Smooth accumulation: an initial attempt to solve the problems of noise and lack of detail in the acquired data. The leftmost depth image shows the captured data without accumulation, and the rightmost one depicts the data accumulated from multiple depth frames, with the added data presented in white. The depth information appears as a single shade of gray here because of the leveling necessary to make the reconstructed object visible.
subscription is used to refer to the image element at row i and column j of an image: In practice, the resulting image C can be accumulated with the next captured frame and so on. Fig. 2 depicts the result of this process. The combination of this data accumulation technique and bilateral ltering is referred to as Smooth Accumulation.

Super-Resolution
The theoretical framework employed by most SR approaches assumes that each LR image is a warped, blurred, decimated and noisy version of the original HR image (Nasrollahi and Moeslund, 2014). Therefore, the SR problem consists of nding the correct set of transformations that turn each LR image back to its HR version and fuses their information in some way. Fig. 3 illustrates the procedure of going from several LR images to one HR image, through sequential noise removal, upsampling, deblurring, and image fusion operations.
After evaluating several third-party SR methods (Mitzel et al., 2009, Farsiu et al., 2004, a new SR approach was developed speci cally to improve the quality of depth maps acquired using low-cost devices. The main motivation for developing a novel SR approach was that the results obtained through the evaluated approaches, geared towards images and videos of real-world scenes, suggested that these approaches do not correctly handle some traits of depth data (such as temporal inconsistency and gaps in the data, which are not common in real-world scenes), introducing artifacts in the nal geometry.
The stages of the proposed Super-Resolution approach are established as follows: pre-processing stage, registration stage, upsampling stage, warping stage, and reconstruction stage. Although the names of these stages are somewhat novel, this approach still lies within theoretical SR frameworks outlined by related work (Park et al., 2003, Nasrollahi andMoeslund, 2014). Notwithstanding, this naming scheme was devised to allow the reader to further di erentiate the proposed approach from existing ones and drill-down on the speci cs of stages which are often bundled together in other studies.
Pre-processing: given the importance of using a priori information to improve the quality in other lowcost 3D reconstruction studies (Raimundo et al., 2018). This stage was included in the proposed SR approach to calculate the bounding box of the 3D volume corresponding to the HR image. The information calculated during this stage of the SR pipeline is used in latter steps of the SR pipeline to avoid introducing extreme values (i.e. values outside of the observed volume) and remove non-linear noise that might be introduced when performing the subsequent image transformations.
Registration: due to the acquisition protocol and nature of the captured artifacts, only global motion between the LR frames must be compensated in this study. Again, drawing from an image processing background, a 2D registration technique (ECC image alignment (Evangelidis and Psarakis, 2008)) was used to acquire a sub-pixel registration of the LR frames to a template (usually the rst frame of the sequence). Using ECC, an a ne model of the translation and rotation between the frames is obtained, which can then be applied to align the images (Fig. 4 -top).
The key here is that the displacement between the subsequent LR frames must be small enough to be compensated accurately via rigid 2D alignment. The current approach yields good results with an angular displacement between 1 • and 3 • and linear displacement of about 1 cm. Despite that, the method is robust with regards to larger displacements, as invalid depth information generated by non-overlapping areas of the LR frames is ltered out in the next stages of the SR pipeline.
Upsampling: to take advantage of the subpixel alignment obtained during registration on the following stages of the SR pipeline, the LR images must be upsampled. In the proposed approach, nearestneighbor scaling is applied to the LR frames to avoid introducing invalid depth data at this stage of the SR pipeline. It has been determined that a scaling factor of 4 works well for 16 LR frames of input; nevertheless, this constant likely has to be adjusted for other acquisition protocols and devices. The higher spatial resolution obtained from upsampling the images allows the non-redundant information present therein to be interwoven during the alignment and reconstruction stages. Fig. 4 (middle) shows how the registration obtained previously is still valid for the upsampled images, also due to the usage of nearest- neighbor scaling, which preserves the features of the input data. Warping: given the upsampled LR frames and the registration obtained previously, it is possible to perform an a ne warp that corresponds the matching parts of the upsampled images ( Fig. 4 -middle). Due to the higher pixel resolution obtained in the previous step, the sub-pixel displacements obtained via ECC in the registration phase now correspond to "whole" pixels. Thus it is possible to align these images with greater precision than before, without losing information because of aliasing. After applying the warp operations, the data from the LR frames is ready to be fused in the next step.
Reconstruction: following the lead of most other SR approaches, the usage of mean and median lters was evaluated for the reconstruction of the HR image. These are some e ective and computationally e cient ways to fuse information from multiple images, which, in some cases, accurately reproduce the results obtained via complex analytical approaches (Nasrollahi and Moeslund, 2014). However, unlike what happens in general SR methods, which are geared towards regular color or grayscale images, a simple mean fusion did not yield good results for depth images due to the discontinuity of the data (Fig. 5 -top-left). As unreliable depth data is registered as a 0 or other invalid value by the depth sensor, simply averaging the values introduces additional noise on the depth map, a median fusion improved over these results but still left some invalid information (Fig. 5 -top-right), with the added disadvantage of introducing depth plateaus on the point cloud. In the nal reconstruction approach, these problems were solved by eliminating zeros from   the mean calculations during the image fusion, which ultimately yielded good results (Fig. 5 -bottom), with very little remaining noise (an expected side-e ect of the mean operation) and smooth, credible, geometry.

Results
One of the main challenges faced by low-cost 3D reconstruction pipelines is the low-resolution, lack of detail, and high amount of noise of the depth data provided by the hardware (Cui et al., 2013). Therefore, considerable e ort was dedicated to improve the results of the acquisition phase because if better geometric data is passed on to the next stages of a low-cost 3D reconstruction pipeline, better 3D models are produced. Henceforth, the results of applying the developed SR method to captured heritage artifacts are presented and discussed.

Experimental Setup
The main execution environment of the software developed within this work was a laptop computer with an 2.5 GHz Intel Core i7 6500-U processor, 8 GB of RAM and a discrete Geforce 940MX graphics card. The acquisition devices were a Microsoft Kinect V1 scanner and a turntable. An overview of the experimental setup is provided in Fig. 6. Software tools were developed from third-party libraries for various purposes. OpenCV (Bradski and Kaehler, 2000) and PCL (Rusu and Cousins, 2011) were used to handle and process images and point clouds, OpenGL (Woo et al., 1999) was the graphics library used for the visualization module, and the libfreenect 1 driver was used in the depth acquisition application to access and retrieve data from the Microsoft Kinect V1.
Two heritage artifacts were captured: a pot with sh-like carvings (heretofore denominated "Fish Pot", Fig. 7) and a turtle-shaped clay pan ("Turtle Pan", Fig. 8). Both pieces belong to the MAE/UFBA collection and are part of the material culture of the Waujá indigenous tribe and have been chosen for this analysis due to their di erences in geometry, nishing and motifs.

Experimental results and discussion
The objective of the proposed SR pipeline was to improve over the smooth accumulation approach whilst still mitigating the two main traits of depth maps acquired using low-cost RGB-D hardware: heavy noise and low amount of detail. Fig. 9 illustrates the results of using the proposed SR pipeline on the Turtle Pan. With comparison to the raw depth map acquired from a Microsoft Kinect V1, more data is present in the capture (≈ 12, 000 vs. ≈ 10, 000 captured points), several holes were lled, and the geometry is overall smoother.
Complementing what was already shown in Fig. 9,    10 shows further results of this approach depicting the geometry acquired from the Turtle Pan. The smoothing capabilities of the SR technique are evident in the image, as the plateaus in the data are much less noticeable; however, it is also important that details have not been lost in the operation and actually became more distinguishable. While in Fig. 10 -Left the shape of the object appears somewhat attened, on Fig. 10 -Right approximates the surface of the object more accurately, and even presents a piece of the object which had not been captured previously. Fig. 11 shows the SR results for the Fish Pot. In this case, the smoothing properties of the technique are still present, but the acquisition of additional geometry and reconstruction of details is more apparent. The overall shape and features of the scanned artifact are at the same time smoother and more well-de ned. The original point cloud (without SR) also presents some distortion, which attens the geometry of the object in the same way that happened to the Turtle Pan, while the SR version of the point cloud more closely approximates the round shape of the object. Table 1 quantitatively presents the results of using the proposed SR technique on other point clouds of tested heritage pieces, the increase in captured vertices ranges from 2.97% to 38.44%. These results con rm that more vertices are obtained using SR than without. Together with a qualitative evaluation of the nal reconstructions, this indicates that reconstruction of detail is obtained via the proposed method, while a reduction in number of vertices, if any, would indicate the removal of noise or invalid geometry.

Conclusions
The depth data obtained from low-cost sensors is usually lacking in detail and consistency, which a ects the quality of the models obtained in 3D reconstruction. Through the proposed Super-Resolution approach, it has been possible to enhance this data and reconstruct detail beyond what the sensor initially provides, as indicated by our experimental results. Moreover, the proposed method also improves the overall quality of the data in terms of smoothness and presence of holes.
Future research should focus on an extended evaluation of the current approach through both the reconstruction of more heritage artifacts and a quantitative analysis of the nal 3D reconstructions, which entails the existence of some baseline ground truth. Another interesting goal for future works is the deployment of a complete 3D reconstruction pipeline for the preservation of heritage artifacts by museum sta . Such a pipeline should take the limitations of low-cost hardware and existing heritage practice into consideration to be as streamlined as possible. Figure 10: Acquired point cloud before (left) and after (right) SR processing -Turtle Pan. Upon closer inspection, it is possible to see that the plateaus in the data are thoroughly smoothed by the SR approach, without sacri cing detail. Figure 11: Acquired point cloud before (left) and after (right) SR processing -Fish Pot. The smoothing properties of the SR approach are once again visible, but in this case the reconstruction of additional geometry is clearer.