The Reconnection of Contour Lines from Scanned Color Images of Topographical Maps Based

on GPU Implementation Jianfeng Song, Panfeng Wang, Qiguang Miao, Ruyi Liu, and Bormin Huang

Abstract—This paper presents a method for the reconnection of contour lines from scanned color images of topographical maps based on graphics processing unit (GPU) implementation. The ex- traction of contour lines, which are shown with brown color on USGS maps, is a difficult process due to aliasing and false colors induced by the scanning process and due to closely spaced and intersecting/overlapping features inherent to the map. First, an effective method is presented for contour line reconnection from scanned topographical maps based on CPU. This method consid- ers both the distance and direction between the two broken points of the contour lines. It gets better performance and has high con- nection rate, but the time complexity of the algorithm is nonlinear with the increasing size of topographical map. Second, the advan- tage of the massively parallel computing capability of GPU with the compute unified device architecture is taken to improve the algorithm. Finally, a better performance has been achieved based on the open source computer vision library. The experimental re- sults show that the GPU implementation with loop-based patterns achieves a speedup of 1360× and the identical result compared with the implementation on CPU.

Index Terms—Compute unified device architecture (CUDA), contour lines, graphics processing unit (GPU), open source computer vision (OpenCV).


G RAPHICS processing unit (GPU) is designed for displayon the computer graphics rendering. It has been widely used in scientific computing and image processing. It is especially suitable for the computation of parallel data. The popular distributed GPU computing model has promoted the

Manuscript received January 12, 2016; revised May 23, 2016; accepted June 9, 2016. Date of publication June 29, 2016; date of current version January 23, 2017. The work was supported in part by the National Natural Science Foun- dations of China under Grant 61472302, Grant 61272280, Grant 41271447, and Grant U1404620, in part by the Program for New Century Excellent Tal- ents in University under Grant NCET-12-0919, in part by the Fundamental Research Funds for the Central Universities under Grant K5051203020, Grant K5051303018, Grant JB150313, and Grant BDY081422, in part by the Natural Science Foundation of Shaanxi Province under Grant 2014JM8310 and Grant 2010JM8027, in part by the Creative Project of the Science and Technology State of Xian under Grant CXY1441(1), and in part by the State Key Lab- oratory of Geo-information Engineering under Grant SKLGIE2014-M-4-4-4. (Corresponding author: Qiguang Miao.)

The authors are with the School of Computer and Technology, Xidian Univer- sity, Xian 710071, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]

Color versions of one or more of the figures in this paper are available online at

Digital Object Identifier 10.1109/JSTARS.2016.2580903

development of the top supercomputers [1]. Only one thread can run on one core of CPU at a moment. GPU uses the lightweight thread that is managed by the hardware which can easily switch with low cost. In the hardware limit, the more the compute unified device architecture (CUDA) threads, the better computing performance you will get. GPU has hundreds of parallel processing units and memory bandwidth which are several times than a computer. GPU has far exceeded CPU in terms of computing capacity. Parallel computing of GPU has become the key technology of accelerating the numerical simulation of science and engineering application.

A contour line of a function of two variables is a curve along which the function has a constant value. Contour lines are the main graphical element to characterize three-dimensional (3-D) terrain on 2-D maps. Color image segmentation is a general method used to separate those desirable features (contour lines) from undesirable features (other information on the maps, like stream, character, etc.). But in the pretreatment of topographical map color segmentation [2]–[7] including background removal [8], [9] and denoising, the situation such as color confusion and overlapping can cause contour gaps. The contour gaps are due to three causes. First, scanned color topographical map infor- mation is not complete and contour line segments are lost with a great randomness. Second, contour line segments are covered by other color information like black text and kilometer grid. Third, there are some supplementary contour lines on the to- pographical map which are drawn with short dashed lines. The existing contour line connection algorithms belong to the fol- lowing categories. Xueliang et al. [10] use a maximum clique graph searching method to reconnect contour lines. The algo- rithm is a graph theoretic searching method and the error rate of this method is relatively low. The method has automatic error correction capability, but it needs artificial participation and is time consuming. Ping [11] uses a Feynman code to reconnect contour lines. This method calculates the code of extended line based on pixels after the broken points of the contour lines. The intersection of two pixels extension part on the broken line achieves the connection. This method performs better on scanned color images of topographical maps that have thicker features which spaced far apart from each other. However, their algorithm is ineffective since the error rate may rise in a situation that contour lines are intensive. Yi [12] develops a method to reconnect contour lines based on prior knowledge. This method takes different ways to search and match broken points of the

1939-1404 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See standards/publications/rights/index.html for more information.


Fig. 1. Preprocessing results of contour line reconnection. (a) Resulting image with background of topographic map removed. (b) Segmented and denoised resulting image. (c) Resulting image with binarization. (d) Thinned and pruned resulting image.

contour lines according to the specific circumstances such as gulch, escarpment, and text label. It avoids the blindness of con- ventional methods, but it is very difficult to integrate all kinds of algorithms for connecting those broken lines. Another new approach combined with the minimum distance and direction difference is proposed by Li [13]. This method considers both the distance and direction between two broken points of the con- tour lines. It gets better performance and has high connection rate, but the computation cost is high and could not meet the requirements of practical applications. Our prior method could not solve the performance requirements especially time perfor- mance in the practical engineering. We improve the algorithm combined with the minimum distance and direction difference and solve the time performance of reconnection of contour lines with NVIDIA GPU. What’s more, the GPU implementation can be used in practical engineering application. The rest of the paper is organized as follows. Section II describes the algorithm design of contour connection. Section III explains the GPUs implemen- tation of contour connection. Section IV concludes the paper.


Color image segmentation is a general method used to sep- arate those desirable features (contour lines) from undesirable features (other information on the maps, like stream, character, etc.). The background of topographical map influences the ex- traction process because the color of contour lines is changed in contrast to their background. First, we remove the background of topographical map with linear feature separation method [14]. Second, we use Gustafson–Kessel clustering algorithm to ex- tract contour lines. Third, the preprocessing in extraction of contour lines is a significant step after extracting the brown con- tour lines from scanned topographical map. The preprocessing of the brown contour lines requires several steps which includes denoising [15]–[17], binarization [18]–[24], thinning [25] and pruning. A simple example which illustrates the preprocessing of contour line reconnection are in Fig. 1.

Fig. 8 shows eight neighborhoods pixels (P1 , P2 , P3 , P4 , P5 , P6 , P7 , P8 ) of central pixel P0 .

Set the foreground pixels as 0 and the background pixels as 1.

Fig. 2. Eight neighborhoods regions of central pixel P0 .

Fig. 3. Break contours in a thinned image.

P0 is a break point in a thinned image, if (1){ P0 = 0 P1 + P2 + P3 + P4 + P5 + P6 + P7 + P8 = 6.


P0 is a contour intersection in a thinned image, if (2){ P0 = 0 P1 + P2 + P3 + P4 + P5 + P6 + P7 + P8 < 6.


The input of contour connection method is a thinned image. First, all break points in the map are found. Then, search for the other break point for each break point which has similar direction and the shortest distance. Finally, connection function is used to connect these two break points. In the topographical maps, most of the pairs of break points are in the opposite or parabola direction. As shown in Fig. 3, color lines are break contours. Red break contours are in parabola direction and blue break contours are in the opposite direction.

The position of the break point can be obtained by searching the thinned image. The match condition between the two break points is as follows.

A. Direction Condition

Fig. 4 shows eight directions of the central break points which coordinates are (i, j). The opposite direction includes d0 and d4 , d2 , and d6 . The parabola directions include d0 and d2 , d2 and d4 , d4 and d6 , d6 and d0 . Any break point’s direction is obtained by backtracking, and the break point is added to the break point set when its direction is appropriate to the central break point.

B. Distance Condition and Angle Condition

Distance condition means the limitation of Euclidean distance between two break points.

The distance is calculated as follows:

Ds (xP a , yP a , xP b , yP b ) = √

(xP a − xP b )2 + (yP a − yP b )2 (3)


Fig. 4. Eight directions of the central break point.

where Pa , Pb are two different point, xP a is the x-coordinate of Pa , xP b is the x-coordinate of Pb , yP a is y-coordinate of Pa , yP b is y-coordinate of Pb .The distance condition is Ds < δ1 , where δ1 is a smaller constant.

Angle condition means the limitation of curve tendency of the contour lines. The angle of two different points is defined as:

ϑ(xP a , yP a , xP b , yP b ) = arctan( yP a − yP b xP a − xP b

) (4)

where Pa , Pb are two different point, xP a is the x-coordinate of Pa , xP b is the x-coordinate of Pb , yP a is the y-coordinate of Pa , and yP b is the y-coordinate of Pb .

The angle of contour line is

ϑ(Pa , L) = average(ϑ(Pa , Pi ))i = a + 1, a + 2, . . .a + k (5)

where Pa is a break point of the contour line L, Pi is a point on the contour line, and there are ten points between Pi and Pi+ 1 . The value k of the factor can be determined empirically.

If there are two contour lines L0 and L1 , then P0 is the break point of L0 and P1 is the break point of L1 . We calculate ϑ(P0 , L0 ), ϑ(P1 , L1 ), and ϑ(P0 , P1 ) where P1 is in the P0 ’s search window region. The angle condition is as follows:

max{|ϑ(P0 , L0 ) − ϑ(P0 , P1 )| , |ϑ(P1 , L1 ) − ϑ(P0 , P1 )|}< δ2 (6)

where δ2 is a smaller constant. In this paper, we combine the distance condition and an-

gle condition and we finally calculate the function value φ(P0 , P1 , L0 , L1 ) as follows:

φ(P0 , P1 , L0 , L1 ) = max{|ϑ(P0 , L0 ) − ϑ(P0 , P1 )|, |ϑ(P1 , L1 ) − ϑ(P0 , P1 )|} × φ1 + Ds (P0 , P1 ) × φ2 (7)

where φ1 is angle weight and φ2 is distance weight, the two parameters can be determined empirically.

(P1 ∗, L1

∗) = arg min{φ(P0 , Pi , L0 , Li )}. (8)

Fig. 5. C code to determine an intersection point.

Fig. 6. Process of contour connection.

P1 ∗ is considered to be the optimal match to P0 and the two

break points belong to the same contour line. The contour lines L0 ,L1

∗ should be connected.

C. Crossing Line Condition

Contour lines never intersect each other and they form closed loops in topographical maps. The reconnection process will be canceled if there is an intersection point in the new line which the two break points constitute. Fig. 5 shows the C code which determines whether there is an intersection point. According to the above-mentioned analysis, the process of contour line reconnection can be expressed as the following steps and the process of contour connection in our study is depicted in Fig. 6.

Step1: Set the initial value hstart, the window threshold hmax, the step size hadd, the angle weight φ1 , and the distance weight φ2 . Set the candidate points set S empty and track the curve to the first break point P0 .

Step 2: In the region which has the size of h × h and centered around P0 , search the other break points P1 , P2 , P3 , ..., Pn and add them to candidate point set S.

Step 3: Compare the direction of P0 and every other points Pi in S. Delete the break points which are not in the opposite or


Fig. 7. CUDA threads hierarchy.

parabola direction relative to P0 . Meanwhile, delete the break points which are in the same line with P0 .

Step 4: Delete those break points which dissatisfy the distance and angle constraints from S. If S is empty, then jump to step 7.

Step 5: Find the minimum of function ϑ(P0 , Pi ) and get the break point Pbest, and then connect P0 and Pbest.

Step 6: If there is no cross point in this region, then the connection is done. Otherwise, cancel this connection and jump to step 7.

Step 7: If h < hmax, then let h = h+hadd and jump to step 2. Otherwise, the iteration is over.

The efficiency of the program is very low and it takes more than 1 h to reconnect 6568 × 8009 thinned image. The algorithm combined with the minimum distance and direction difference includes these characteristics. First, the image data are parallel. Parallel data are the property of application program, and calcu- lation operations can be safely performed according to certain data structures at the same time. Second, the problem can be partitioned into subproblems. We can use multithreaded pro- gram implementation to improve the performance. Third, the algorithm is one of the local connection methods and the loop operation of the whole image can be replaced by CUDA thread and block indices.


A. CUDA C Programming Method

CUDA is a general solution of complete set of parallel com- puting launched by NVIDIA and designed for parallel comput- ing hardware and software architecture [26]. CUDA is widely used in matrix numerical computing, video and image process- ing, machine learning, etc. GPU could be regarded as an equip- ment which can carry out a large number of parallel threads when CUDA application is compiled [27]. Tens of thousands

Fig. 8. Schematic visualization of a multiprocessor.

of threads can be used to accelerate the CUDA application that can make the program execution performance for exponential improvement. CUDA C programming language is a simple ex- tension of the standard C programming language and it is very convenient for developers to access memory data with pointer operation in the kernel function. GPU applications can be devel- oped in a grammar that is quite close to the standard C program- ming language. The general process for CUDA C programming is accomplished by multistep process. Initially, the parallel part in the algorithm including the parallel data and process should be separated from the whole algorithm, and then the host mem- ory is copied to the device memory before the kernel function is called. A kernel function should point out thread blocks and the number of threads per block. Finally, the device memory data should be copied back to the host memory and GPU computing result is obtained. The CUDA code consists of three computa- tional phases: transmission of data into the global memory of the GPU, execution of the CUDA kernel, and transmission of results from the GPU into the memory of CPU. The CUDA thread hierarchy is showed in Fig. 7. CUDA takes a bottom-up point of view of parallelism in which a thread is an atomic unit of parallelism. Threads are organized into a three-level hierar- chy. The highest level is a grid, which consists of thread blocks. A grid is a 3-D array of thread blocks. Thread blocks implement coarse-grained scalable data parallelism and they are executed


independently, which allows them to be scheduled in any order across any number of cores. This allows the CUDA code to scale with the number of processors. A kernel can be executed by multiple equally shaped thread blocks, so that the total num- ber of threads is equal to the number of threads per block times the number of blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.

B. GPU Implementation Based on Divide-and-Conquer Patterns

The massive parallelism on the GPU hardware is achieved by a plurality of the same stream multiprocessors. Each stream mul- tiprocessor independently schedules its internal resources and execution units to deal with assigned thread. NVIDIA GeForce GTX 750 GPU has four stream multiprocessors. Fig. 8 presents a schematic visualization of a multiprocessor. Each multipro- cessor has 32 CUDA cores and it executes in parallel with the other multiprocessors. All the 32 cores in a group execute in a single instruction multiple data fashion with all cores in the same multiprocessor executing the same instruction at the same time. There are eight special function units (SFU) inside each multiprocessor for single-precision floating-point transcenden- tal functions. Each multiprocessor has eight load (LD)/store (ST) units for memory access. Each multiprocessor also has 32768 32-bit registers which can be accessed in a single clock cycle. GTX 750 has 1 GB of global memory, which has a higher bandwidth than the DRAM memory in the CPUs. In massively parallel computing, divide and conquer is the most common and most effective strategy. Programmer should partition the prob- lem into coarse subproblems that can be solved independently in parallel by blocks of threads, and each subproblem into finer pieces that can be solved cooperatively in parallel by all threads within the block. By giving each thread a unique global index in the ith and jth domain using thread and block indices. The result of this thread mapping is that multiple threads are carrying out the same computation on different image data in parallel. We use open source computer vision (OpenCV) code library to read image data from local disk and assign the appropriate size on the GPU device memory through CUDA memory manage- ment functions. Image data are copied from the host memory to the device memory and then kernel function is called to reconnect the contour line. The method and the constant pa- rameters to search and match break points remain the same as CPU method. After all the threads exit normally, CUDA mem- ory management functions should be called to copy the result image back to the host memory. The number of threads per block is 64. The size of the grid is determined according to the size of the image and the block size. For fair comparison, we ran both our GPU code and CPU code on our HP Z600 Work- station (Intel(R) Xeon(R) CPU E5520 @2.27 GHz) equipped with NVIDIA GTX 750 GPU where thread block size = 64, image block = 120 × 120, hmax = 50, hadd = 5, φ1 = 0.35, φ2 = 0.65 and no I/O transfer. As seen from the speedup results in Table I, our implementation on a GPU is faster than the other implementation because we make full use of parallel progra- mming approach. We achieves a speedup of 251.2× without



CPU runtime (s) GPU runtime (s) Speedup

4020 16 251.2×

Fig. 9. Image A (1293 × 705).

Fig. 10. Image B (1904 × 973).

any program and algorithm optimization. We use connection rate to measure the accuracy of the algorithm results. The con- nection rate is defined as

w = F/G (9)

where F is the number of connected break points and G is the number of the total break points. Figs. 9 to 12 show the examples of image in different size.

Fig. 13 demonstrates the connection rate of different images. As seen from Fig. 13, in general, the larger image block, the better connection results are obtained. The main reason is the break points between different blocks cannot be connected in parallel process. But there is an exception when the image block size is 90 × 90 and the image size is 1293 × 705 which is named Image A. It is the kilometer grid in the image that causes a large number of break point and the image block boundary is just near the kilometer grid which leads to unconnected break points. Another problem is that we cannot set a fixed block size to obtain the best result in different image. In a word, the result based on divide and conquer in parallel programming cannot always be the same as the CPU in serial programming.


Fig. 11. Image C (5000 × 3810).

Fig. 12. Image D (6568 × 8009).

Fig. 13. Connection rate of different images in different size.

C. Further Improvements With Loop-Based Patterns

If the number of thread that we schedule is equal to the number of pixels of in the image, computing performance is improved because more threads can make up the reading latency of a sin- gle thread in general. Loop-based patterns are used to expand the loop in the program. In order to improve the image process- ing result, GPU is used to search and record the matched break points. The cross judgments and reconnection process are exe- cuted on CPU. We separate the image writing operations from the kernel function. The loops for spatial grid points (i, j) are

Fig. 14. Contrast of connection rate between divide-and-conquer pattern and loop-based patterns.

Fig. 15. Contrast of speedup between divide-and-conquer pattern and loop-based patterns.

replaced by index computations using thread and block indices:

i = blockIdx.x ∗ blockDim.x + threadIdx.x j = blockIdx.y ∗ blockDim.y + threadIdx.y dim3 dimGrid((h + T I − 1)/T I, (w + T I − 1)/T I) dim3 dimBlock(T I, T I)

where h and w are the height and width of the image, the T I is the dimension of thread block.One thread deals with one image pixel processing and part of the computing tasks are as- signed to the CPU. Fig. 15 gives the contrast of connection rate between divide-and-conquer pattern and loop-based pat- tern. The image block is 120 × 120. Loop-based patterns are useful. The result image is the same as the one on CPU pro- gramming. Large number of threads in parallel improve the computing performance as Fig. 14 shows. As Fig. 15 shows, the acceleration of the GPU is not very obvious when the image size is small. GPU makes great acceleration for larger images and achieves higher computation performance with loop-based pattern. CPU takes about 67 min, but the improved algorithm using GPU reduces the computation time to 13 s and achieves a speedup of 309.2×.


Fig. 16. Speedup of the CUDA C codes versus block size.

D. Impact of Block Size on GPU-Based Contour Connection

Efficiency of the CUDA C program can be further improved by adjusting the thread block size. The block size is a multiple of 32 threads (i.e., a warp), the computing performance is usually better than the neighboring block sizes since threads of 32 are grouped together and the multiprocessor sends the same instruc- tion to all the threads in a warp in order to make execution more efficient, which is one of the merits of GPU architecture. But it may cause excessive memory requirements which make the low efficiency of stream processors when the block size is big. The effect of the thread block size on the computing performance was evaluated by varying the block size whilst keeping the reg- isters per threads at 43. Fig. 16 shows the speedup of the CUDA C codes versus the block size. It was found that the block size of 64 could produce the best performance. This is what we used in previous sections.

E. Impact of Registers/Thread on GPU-Based Contour Connection

If we define possible variables as registers, the computation performance is greatly improved because global memory ac- cess for data transfer takes much more time than register access. There are thousands of registers in each multiprocessor stream. The kernel function calculates the number of registers per thread when the CUDA C program is compiled. If the number of reg- isters per thread is too large, the number of thread blocks that can be scheduled in each multiprocessor stream is limited. If the number of threads is too small, the GPU hardware resource can- not be fully used and the performance decreases. Fig. 17 shows that the best computation performance occurs at 43 registers per thread.

F. Data Transfer Between CPU and GPU

This experiment supports the general findings that data trans- fer takes up a lot of time and apparently is an important factor for the speedup. When I/O transfer is considered, the use of overlapped CUDA kernel execution with asynchronous mem- ory access for data transfer can be applied. Each NVidia Tesla

Fig. 17. Speedup versus registers per thread.

Fig. 18. Overlapping memory transfer with host-to-device memory transfer, GPU kernel execution, and device-to-host memory transfer.


Total runtime Host to device

Kernel execution

Device to host

CPU 4020 000 ms − − − GPU (with nonoverlapping) 11 764 ms 16 ms 11 716 ms 32 ms GPU (with overlapping) 11 706 ms − − − Speedup (with nonoverlapping) 341.7× − − − Speedup (with overlapping) 343.4× − − −

K40 has one copy engine and one kernel engine, allowing over- lapped data transfer and kernel execution. We choose NVidia Tesla K40 with 2880 cores as the GPU equipment. A diagram explaining the computation timeline of the contour line recon- nection of GPU implementation is depicted in Fig. 18. As com- mands pipeline, streams, illustrated in different colors, execute commands in a manner of first-in-first-out order. As a result, the same stream would arrange the overlapping between CPU-to- GPU and GPU-to-CPU memory transfers and kernel execution on GPUs with two copy engines. Table II lists the GPU running time using thread block size = 64, registers per thread = 43, image size = 6568 × 8009, hmax = 50 , hadd = 5 , φ1 = 0.35,



GPU devices CUDA cores Speedup

NVIDIA GEFORCE GTX 750 512 343.4× NVIDIA GEFORCE GTX 650Ti 768 510.8× NVIDIA GEFORCE GTX 760 1152 816.5× NVIDIA Tesla K40 2880 1360.9×

Fig. 19. Resulting image of contour lines reconnection.

φ2 = 0.65. When overlapping access is taken into account, the results of computing performance are listed at the sec- ond column of Table II, where it shows that using overlapping access can achieve better performance than nonoverlapping ac- cess. Different GPU devices obtain different computation per- formances and Table III lists the experiment results where thread block size = 64, registers per thread = 43, image size = 6568 × 8009, hmax = 50, hadd = 5, φ1 = 0.35, φ2 = 0.65. As Table III shows, we have achieved a speedup of 1360.9× and the same connection rate compared with the implementation on CPU. Fig. 19 shows the result of contour line reconnection in which pink lines are reconnected lines.


Contour lines are the main graphical element to characterize 3-D terrain on 2-D maps. But in the pretreatment of topographi- cal map color segmentation including background removal and denoising, the situation such as color confusion and overlapping can cause contour gaps. Contour lines are always overlapped or intersected with other information on the topographical maps. It is difficult to clean up all other unexpected information while re- taining the …