QUALITY MESH GENERATION ON GPU
Zoom Presentation
https://nus-sg.zoom.us/j/92703632363?pwd=c1dnVno2NzZSbUQ4L2ppT2J2WVF6Zz09
password: chenzh0403
Abstract:
Mesh generation is one important computation step in many engineering and scientic applications, such as finite element, interpolation, GIS, path planning, etc. The input to mesh generation is usually a set of points P, a set of line segments L where each is formed with endpoints in P, and a set of faces where each has boundary defined with line segments in L. These sets are abstraction of important features of the domain to generate an output mesh conforming to them and with augmentation of new points, called Steiner points, for subsequent processing. In general, output has to meet some quality measures, such as no small angles in triangles of the mesh, the aspect ratios of the triangles or tetrahedra of the mesh are to be bounded etc. And, it is desirable not to use excessively many Steiner points, as a large output mesh means, in general, a long computation time needed for subsequent processing.
There are good algorithms and software developed in the past 30 years to address mesh generation in 2D and 3D, notably, the Triangle software by Shewchuk, the TetGen software by Hang, and the CGAL software library. In 2D, the input is termed planar straight line graph (PSLG), while in 3D, piecewise linear complex (PLC). These algorithms usually add Steiner point one by one to refine a mesh until a quality one is reached. However, the increase in the input size can lead such software to take hours to complete, even with the most powerful PC available to date. This is undesirable to many, especially interactive, applications. Thus, we were curious what may be possibly achieved with the inexpensive, parallel computing power of GPU, capable of launching thousands of concurrent threads, when applied to refining meshes.
Our review of prior work shows that there was no particularly successful attempt in adapting GPU to compute quality meshes. Though there are reports on such works, for the past 15 years of flourishing development in GPU, none have provided convincing evidence of correct answer nor good performance for mesh generation with PSLG or PLC. We thus started exploring GPU for mesh generation a few years ago. It was clear from the early days of this research that a good GPU algorithm would not be a direct porting from a CPU algorithm. In particular, it is not obvious how to do waste management that is important in exploiting the potential of GPU hardware. At a glance, one may attempt to refine a mesh by simply inserting Steiner points concurrently with thousands of GPU threads to gain good speedup. This, however, is problematic as independent, concurrent and uncoordinated insertions can lead to redundant ones and also cast serious doubt on whether the approach can ever terminate in finnite time.
With the above review and challenges in mind, we began to explore how to design and implement a performant mesh generator on the GPU. Along the way, we found out that the usual practices of regularized work and localized data are not enough to achieve a performant GPU algorithm because of the existence of the waste, i.e., computations that are not useful. Therefore, the management of waste is important for a GPU program to really get a good speedup, but it has not been well addressed. This is probably the reason why there have been no working GPU mesh generators so far. To this end, we discovered and summarized a set of high-level rules for the waste management in GPU when running computational programs. They cater to expected scenarios based on workload for a single kernel as well as special cases of workloads of a serial of kernels within a small window. The rules are applied to GPU algorithms for 2D and 3D Delaunay mesh refinement described as follows.
We started the research with an in-depth study on the 2D constrained Delaunay refinement problem. We learn that the Triangle and CGAL are the fastest CPU Delaunay mesh generators that produce quality mesh with no angle smaller than an input angle θ. The former partially unifies the algorithms of Ruppert and of Chew, whereas the latter implements Ruppert's algorithm. In Ruppert's algorithm, one needs to compute all the segments that are encroached by a point (i.e., the point is inside the circumcircles of the line segments), while in Chew's algorithm, one needs to search for all points encroaching a segment. Both types of encroachment are inefficient when done naively with GPU threads due to unbalance workload among them. Guided by the rules, our efforts have found a way to avoid unbalance workload by resolving all encroachments indirectly through Delaunay property coupled with a so-called Flip-Flop algorithm. Our implementation is shown to run from a few times to an order of magnitude faster than the Triangle and CGAL software. It is also proven to terminate with finite output size for an input PSLG with no angle smaller than 60 degrees and theta less than or equal to 20.7 degrees. In addition, we notice that meshes generated by our algorithm are of similar sizes to those by Triangle and CGAL.
With the good result on the 2D constrained Delaunay refinement on the GPU, we moved on to examine the 3D constrained Delaunay refinement problem. In 3D, we learned that the so-called constrained Delaunay Triangulation (CDT) of a given PLC may not exist unless all segments of the PLC are strongly Delaunay. One can do so with Steiner points added to line segments to generate a CDT efficiently. The problem thus becomes how fast one can generate a quality CDT from a given CDT. Our review of prior work shows that TetGen is the best, but can still take a long time to complete its computation. This provided good justification to once again explore the use of GPU. The obvious challenge is to realize the insertion of multiple Steiner points concurrently and efficiently. However, the useful mechanism of the Flip-Flop algorithm in 2D is no longer workable in 3D as flipping is not known to reach all possible triangulations. With the rules, our effort has found a way to utilize thousands of GPU threads to concurrently grow and shrink so-called cavities of varying sizes while taking good care of workload balancing to then add multiple Steiner points to realize a good speedup. Its implementation shows that it can be an order of magnitude faster than TetGen while using similar quantities of Steiner points to produce triangulations of comparable qualities and with guarantee of termination.
With the successes and experiences gained from the study of the above two problems in 2D and 3D, we continued to explore the 3D restricted Delaunay refinement problem. The output mesh for this problem is not strictly conforming to the input PLC. Rather, it is a Delaunay triangulation that approximates the input domain boundary by a new boundary triangulation using the notion of a restricted Delaunay triangulation. This means that all polygons in the input PLC are approximated by the union of triangular facets whose Voronoi duals intersect any input polygons, and the domain of the PLC is covered by tetrahedra of good qualities. We found out that the CGAL is the best software for this problem, but can still take hours to output a quality mesh. There is thus a need to explore how to solve this problem on the GPU. The main challenges include how to efficiently perform the point insertion and the intersection test needed to verify whether a facet approximates any input polygons and whether a tetrahedron is inside the domain of the PLC. With the help of the rules, we managed to design an efficient routine called ExpandList which can not only compute the cavities of varying sizes to insert Steiner points but also perform intersection tests for facets and tetrahedra concurrently. Our implementation shows that it can run an order of magnitude faster than the multi-threading CGAL while producing meshes of comparable sizes and qualities with guarantee of termination.