Reference documentation for deal.II version GIT defb42778c 20221205 01:15:02+00:00

This tutorial depends on step16, step40.
Table of contents  

This program was contributed by Katharina Kormann and Martin Kronbichler.
The algorithm for the matrixvector product is based on the article A generic interface for parallel cellbased finite element operator application by Martin Kronbichler and Katharina Kormann, Computers and Fluids 63:135–147, 2012, and the paper "Parallel finite element operator application: Graph partitioning and coloring" by Katharina Kormann and Martin Kronbichler in: Proceedings of the 7th IEEE International Conference on eScience, 2011.
This work was partly supported by the German Research Foundation (DFG) through the project "Highorder discontinuous Galerkin for the exascale" (ExaDG) within the priority program "Software for Exascale Computing" (SPPEXA). The largescale computations shown in the results section of this tutorial program were supported by Gauss Centre for Supercomputing e.V. (www.gausscentre.eu) by providing computing time on the GCS Supercomputer SuperMUC at Leibniz Supercomputing Centre (LRZ, www.lrz.de) through project id pr83te.
This example shows how to implement a matrixfree method, that is, a method that does not explicitly store the matrix elements, for a secondorder Poisson equation with variable coefficients on a hypercube. The linear system will be solved with a multigrid method and uses largescale parallelism with MPI.
The major motivation for matrixfree methods is the fact that on today's processors access to main memory (i.e., for objects that do not fit in the caches) has become the bottleneck in many solvers for partial differential equations: To perform a matrixvector product based on matrices, modern CPUs spend far more time waiting for data to arrive from memory than on actually doing the floating point multiplications and additions. Thus, if we could substitute looking up matrix elements in memory by recomputing them — or rather, the operator represented by these entries —, we may win in terms of overall runtime even if this requires a significant number of additional floating point operations. That said, to realize this with a trivial implementation is not enough and one needs to really look at the details to gain in performance. This tutorial program and the papers referenced above show how one can implement such a scheme and demonstrates the speedup that can be obtained.
In this example, we consider the Poisson problem
\begin{eqnarray*}  \nabla \cdot a(\mathbf x) \nabla u &=& 1, \\ u &=& 0 \quad \text{on}\ \partial \Omega \end{eqnarray*}
where \(a(\mathbf x)\) is a variable coefficient. Below, we explain how to implement a matrixvector product for this problem without explicitly forming the matrix. The construction can, of course, be done in a similar way for other equations as well.
We choose as domain \(\Omega=[0,1]^3\) and \(a(\mathbf x)=\frac{1}{0.05 + 2\\mathbf x\^2}\). Since the coefficient is symmetric around the origin but the domain is not, we will end up with a nonsymmetric solution.
In order to find out how we can write a code that performs a matrixvector product, but does not need to store the matrix elements, let us start at looking how a finite element matrix A is assembled:
\begin{eqnarray*} A = \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_{\mathrm{cell,{locglob}}}^T A_{\mathrm{cell}} P_{\mathrm{cell,{locglob}}}. \end{eqnarray*}
In this formula, the matrix P_{cell,locglob} is a rectangular matrix that defines the index mapping from local degrees of freedom in the current cell to the global degrees of freedom. The information from which this operator can be built is usually encoded in the local_dof_indices
variable and is used in the assembly calls filling matrices in deal.II. Here, A_{cell} denotes the cell matrix associated with A.
If we are to perform a matrixvector product, we can hence use that
\begin{eqnarray*} y &=& A\cdot u = \left(\sum_{\text{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T A_\mathrm{cell} P_\mathrm{cell,{locglob}}\right) \cdot u \\ &=& \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T A_\mathrm{cell} u_\mathrm{cell} \\ &=& \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T v_\mathrm{cell}, \end{eqnarray*}
where u_{cell} are the values of u at the degrees of freedom of the respective cell, and v_{cell}=A_{cell}u_{cell} correspondingly for the result. A naive attempt to implement the local action of the Laplacian would hence be to use the following code:
Here we neglected boundary conditions as well as any hanging nodes we may have, though neither would be very difficult to include using the AffineConstraints class. Note how we first generate the local matrix in the usual way as a sum over all quadrature points for each local matrix entry. To form the actual product as expressed in the above formula, we extract the values of src
of the cellrelated degrees of freedom (the action of P_{cell,locglob}), multiply by the local matrix (the action of A_{cell}), and finally add the result to the destination vector dst
(the action of P_{cell,locglob}^{T}, added over all the elements). It is not more difficult than that, in principle.
While this code is completely correct, it is very slow. For every cell, we generate a local matrix, which takes three nested loops with loop length equal to the number of local degrees of freedom to compute. The multiplication itself is then done by two nested loops, which means that it is much cheaper.
One way to improve this is to realize that conceptually the local matrix can be thought of as the product of three matrices,
\begin{eqnarray*} A_\mathrm{cell} = B_\mathrm{cell}^T D_\mathrm{cell} B_\mathrm{cell}, \end{eqnarray*}
where for the example of the Laplace operator the (q*dim+d,i)th element of B_{cell} is given by fe_values.shape_grad(i,q)[d]
. This matrix consists of dim*n_q_points
rows and dofs_per_cell
columns. The matrix D_{cell} is diagonal and contains the values fe_values.JxW(q) * coefficient_values[q]
(or, rather, dim
copies of each of these values). This kind of representation of finite element matrices can often be found in the engineering literature.
When the cell matrix is applied to a vector,
\begin{eqnarray*} A_\mathrm{cell}\cdot u_\mathrm{cell} = B_\mathrm{cell}^T D_\mathrm{cell} B_\mathrm{cell} \cdot u_\mathrm{cell}, \end{eqnarray*}
one would then not form the matrixmatrix products, but rather multiply one matrix at a time with a vector from right to left so that only three successive matrixvector products are formed. This approach removes the three nested loops in the calculation of the local matrix, which reduces the complexity of the work on one cell from something like \(\mathcal {O}(\mathrm{dofs\_per\_cell}^3)\) to \(\mathcal {O}(\mathrm{dofs\_per\_cell}^2)\). An interpretation of this algorithm is that we first transform the vector of values on the local DoFs to a vector of gradients on the quadrature points. In the second loop, we multiply these gradients by the integration weight and the coefficient. The third loop applies the second gradient (in transposed form), so that we get back to a vector of (Laplacian) values on the cell dofs.
The bottleneck in the above code is the operations done by the call to FEValues::reinit for every cell
, which take about as much time as the other steps together (at least if the mesh is unstructured; deal.II can recognize that the gradients are often unchanged on structured meshes). That is certainly not ideal and we would like to do better than this. What the reinit function does is to calculate the gradient in real space by transforming the gradient on the reference cell using the Jacobian of the transformation from real to reference cell. This is done for each basis function on the cell, for each quadrature point. The Jacobian does not depend on the basis function, but it is different on different quadrature points in general. If you only build the matrix once as we've done in all previous tutorial programs, there is nothing to be optimized since FEValues::reinit needs to be called on every cell. In this process, the transformation is applied while computing the local matrix elements.
In a matrixfree implementation, however, we will compute those integrals very often because iterative solvers will apply the matrix many times during the solution process. Therefore, we need to think about whether we may be able to cache some data that gets reused in the operator applications, i.e., integral computations. On the other hand, we realize that we must not cache too much data since otherwise we get back to the situation where memory access becomes the dominating factor. Therefore, we will not store the transformed gradients in the matrix B, as they would in general be different for each basis function and each quadrature point on every element for curved meshes.
The trick is to factor out the Jacobian transformation and first apply the gradient on the reference cell only. This operation interpolates the vector of values on the local dofs to a vector of (unitcoordinate) gradients on the quadrature points. There, we first apply the Jacobian that we factored out from the gradient, then apply the weights of the quadrature, and finally apply the transposed Jacobian for preparing the third loop which tests by the gradients on the unit cell and sums over quadrature points.
Let us again write this in terms of matrices. Let the matrix B_{cell} denote the cellrelated gradient matrix, with each row containing the values on the quadrature points. It is constructed by a matrixmatrix product as
\begin{eqnarray*} B_\mathrm{cell} = J_\mathrm{cell}^{\mathrm T} B_\mathrm{ref\_cell}, \end{eqnarray*}
where B_{ref_cell} denotes the gradient on the reference cell and J^{T}_{cell} denotes the inverse transpose Jacobian of the transformation from unit to real cell (in the language of transformations, the operation represented by J^{T}_{cell} represents a covariant transformation). J^{T}_{cell} is blockdiagonal, and the blocks size is equal to the dimension of the problem. Each diagonal block is the Jacobian transformation that goes from the reference cell to the real cell.
Putting things together, we find that
\begin{eqnarray*} A_\mathrm{cell} = B_\mathrm{cell}^T D B_\mathrm{cell} = B_\mathrm{ref\_cell}^T J_\mathrm{cell}^{1} D_\mathrm{cell} J_\mathrm{cell}^{\mathrm T} B_\mathrm{ref\_cell}, \end{eqnarray*}
so we calculate the product (starting the local product from the right)
\begin{eqnarray*} v_\mathrm{cell} = B_\mathrm{ref\_cell}^T J_\mathrm{cell}^{1} D J_\mathrm{cell}^{\mathrm T} B_\mathrm{ref\_cell} u_\mathrm{cell}, \quad v = \sum_{\mathrm{cell}=1}^{\mathrm{n\_cells}} P_\mathrm{cell,{locglob}}^T v_\mathrm{cell}. \end{eqnarray*}
Note how we create an additional FEValues object for the reference cell gradients and how we initialize it to the reference cell. The actual derivative data is then applied by the inverse, transposed Jacobians (deal.II calls the Jacobian matrix from real to unit cell inverse_jacobian, as the forward transformation is from unit to real cell). The factor \(J_\mathrm{cell}^{1} D_\mathrm{cell} J_\mathrm{cell}^{\mathrm T}\) is blockdiagonal over quadrature. In this form, one realizes that variable coefficients (possibly expressed through a tensor) and general grid topologies with Jacobian transformations have a similar effect on the coefficient transforming the unitcell derivatives.
At this point, one might wonder why we store the matrix \(J_\mathrm{cell}^{\mathrm T}\) and the coefficient separately, rather than only the complete factor \(J_\mathrm{cell}^{1} D_\mathrm{cell} J_\mathrm{cell}^{\mathrm T}\). The latter would use less memory because the tensor is symmetric with six independent values in 3D, whereas for the former we would need nine entries for the inverse transposed Jacobian, one for the quadrature weight and Jacobian determinant, and one for the coefficient, totaling to 11 doubles. The reason is that the former approach allows for implementing generic differential operators through a common framework of cached data, whereas the latter specifically stores the coefficient for the Laplacian. In case applications demand for it, this specialization could pay off and would be worthwhile to consider. Note that the implementation in deal.II is smart enough to detect Cartesian or affine geometries where the Jacobian is constant throughout the cell and needs not be stored for every cell (and indeed often is the same over different cells as well).
The final optimization that is most crucial from an operation count point of view is to make use of the tensor product structure in the basis functions. This is possible because we have factored out the gradient from the reference cell operation described by B_{ref_cell}, i.e., an interpolation operation over the completely regular data fields of the reference cell. We illustrate the process of complexity reduction in two space dimensions, but the same technique can be used in higher dimensions. On the reference cell, the basis functions are of the tensor product form \(\phi(x,y) = \varphi_i(x) \varphi_j(y)\). The part of the matrix B_{ref_cell} that computes the first component has the form \(B_\mathrm{sub\_cell}^x = B_\mathrm{grad,x} \otimes B_\mathrm{val,y}\), where B_{grad,x} and B_{val,y} contain the evaluation of all the 1D basis functions on all the 1D quadrature points. Forming a matrix U with U(j,i) containing the coefficient belonging to basis function \(\varphi_i(x) \varphi_j(y)\), we get \((B_\mathrm{grad,x} \otimes B_\mathrm{val,y})u_\mathrm{cell} = B_\mathrm{val,y} U B_\mathrm{grad,x}\). This reduces the complexity for computing this product from \(p^4\) to \(2 p^3\), where p1 is the degree of the finite element (i.e., equivalently, p is the number of shape functions in each coordinate direction), or \(p^{2d}\) to \(d p^{d+1}\) in general. The reason why we look at the complexity in terms of the polynomial degree is since we want to be able to go to high degrees and possibly increase the polynomial degree p instead of the grid resolution. Good algorithms for moderate degrees like the ones used here are linear in the polynomial degree independent on the dimension, as opposed to matrixbased schemes or naive evaluation through FEValues. The techniques used in the implementations of deal.II have been established in the spectral element community since the 1980s.
Implementing a matrixfree and cellbased finite element operator requires a somewhat different program design as compared to the usual matrix assembly codes shown in previous tutorial programs. The data structures for doing this are the MatrixFree class that collects all data and issues a (parallel) loop over all cells and the FEEvaluation class that evaluates finite element basis functions by making use of the tensor product structure.
The implementation of the matrixfree matrixvector product shown in this tutorial is slower than a matrixvector product using a sparse matrix for linear elements, but faster for all higher order elements thanks to the reduced complexity due to the tensor product structure and due to less memory transfer during computations. The impact of reduced memory transfer is particularly beneficial when working on a multicore processor where several processing units share access to memory. In that case, an algorithm which is computation bound will show almost perfect parallel speedup (apart from possible changes of the processor's clock frequency through turbo modes depending on how many cores are active), whereas an algorithm that is bound by memory transfer might not achieve similar speedup (even when the work is perfectly parallel and one could expect perfect scaling like in sparse matrixvector products). An additional gain with this implementation is that we do not have to build the sparse matrix itself, which can also be quite expensive depending on the underlying differential equation. Moreover, the above framework is simple to generalize to nonlinear operations, as we demonstrate in step48.
Above, we have gone to significant lengths to implement a matrixvector product that does not actually store the matrix elements. In many user codes, however, one wants more than just doing a few matrixvector products — one wants to do as few of these operations as possible when solving linear systems. In theory, we could use the CG method without preconditioning; however, that would not be very efficient for the Laplacian. Rather, preconditioners are used for increasing the speed of convergence. Unfortunately, most of the more frequently used preconditioners such as SSOR, ILU or algebraic multigrid (AMG) cannot be used here because their implementation requires knowledge of the elements of the system matrix.
One solution is to use geometric multigrid methods as shown in step16. They are known to be very fast, and they are suitable for our purpose since all ingredients, including the transfer between different grid levels, can be expressed in terms of matrixvector products related to a collection of cells. All one needs to do is to find a smoother that is based on matrixvector products rather than all the matrix entries. One such candidate would be a damped Jacobi iteration that requires access to the matrix diagonal, but it is often not sufficiently good in damping all highfrequency errors. The properties of the Jacobi method can be improved by iterating it a few times with the socalled Chebyshev iteration. The Chebyshev iteration is described by a polynomial expression of the matrixvector product where the coefficients can be chosen to achieve certain properties, in this case to smooth the highfrequency components of the error which are associated to the eigenvalues of the Jacobipreconditioned matrix. At degree zero, the Jacobi method with optimal damping parameter is retrieved, whereas higher order corrections are used to improve the smoothing properties. The effectiveness of Chebyshev smoothing in multigrid has been demonstrated, e.g., in the article M. Adams, M. Brezina, J. Hu, R. Tuminaro. Parallel multigrid smoothers: polynomial versus Gauss–Seidel, J. Comput. Phys. 188:593–610, 2003. This publication also identifies one more advantage of Chebyshev smoothers that we exploit here, namely that they are easy to parallelize, whereas SOR/Gauss–Seidel smoothing relies on substitutions, for which a naive parallelization works on diagonal subblocks of the matrix, thereby decreases efficiency (for more detail see e.g. Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, 2nd edition, 2003, chapters 11 & 12).
The implementation into the multigrid framework is then straightforward. The multigrid implementation in this program is similar to step16 and includes adaptivity.
The computational kernels for evaluation in FEEvaluation are written in a way to optimally use computational resources. To achieve this, they do not operate on double data types, but something we call VectorizedArray (check e.g. the return type of FEEvaluationBase::get_value, which is VectorizedArray for a scalar element and a Tensor of VectorizedArray for a vector finite element). VectorizedArray is a short array of doubles or float whose length depends on the particular computer system in use. For example, systems based on x8664 support the streaming SIMD extensions (SSE), where the processor's vector units can process two doubles (or four singleprecision floats) by one CPU instruction. Newer processors (from about 2012 and onwards) support the socalled advanced vector extensions (AVX) with 256 bit operands, which can use four doubles and eight floats, respectively. Vectorization is a singleinstruction/multipledata (SIMD) concept, that is, one CPU instruction is used to process multiple data values at once. Often, finite element programs do not use vectorization explicitly as the benefits of this concept are only in arithmetic intensive operations. The bulk of typical finite element workloads are memory bandwidth limited (operations on sparse matrices and vectors) where the additional computational power is useless.
Behind the scenes, optimized BLAS packages might heavily rely on vectorization, though. Also, optimizing compilers might automatically transform loops involving standard code into more efficient vectorized form (deal.II uses OpenMP SIMD pragmas inside the regular loops of vector updates). However, the data flow must be very regular in order for compilers to produce efficient code. For example, already the automatic vectorization of the prototype operation that benefits from vectorization, matrixmatrix products, fails on most compilers (as of writing this tutorial in early 2012 and updating in late 2016, neither gcc nor the Intel compiler manage to produce useful vectorized code for the FullMatrix::mmult function, and not even on the simpler case where the matrix bounds are compiletime constants instead of runtime constants as in FullMatrix::mmult). The main reason for this is that the information to be processed at the innermost loop (that is where vectorization is applied) is not necessarily a multiple of the vector length, leaving parts of the resources unused. Moreover, the data that can potentially be processed together might not be laid out in a contiguous way in memory or not with the necessary alignment to address boundaries that are needed by the processor. Or the compiler might not be able to prove that data arrays do not overlap when loading several elements at once.
In the matrixfree implementation in deal.II, we have therefore chosen to apply vectorization at the level which is most appropriate for finite element computations: The cellwise computations are typically exactly the same for all cells (except for indices in the indirect addressing used while reading from and writing to vectors), and hence SIMD can be used to process several cells at once. In all what follows, you can think of a VectorizedArray to hold data from several cells. Remember that it is not related to the spatial dimension and the number of elements e.g. in a Tensor or Point.
Note that vectorization depends on the CPU the code is running on and for which the code is compiled. In order to generate the fastest kernels of FEEvaluation for your computer, you should compile deal.II with the socalled native processor variant. When using the gcc compiler, it can be enabled by setting the variable CMAKE_CXX_FLAGS
to "march=native"
in the cmake build settings (on the command line, specify DCMAKE_CXX_FLAGS="march=native"
, see the deal.II README for more information). Similar options exist for other compilers. We output the current vectorization length in the run() function of this example.
As mentioned above, all components in the matrixfree framework can easily be parallelized with MPI using domain decomposition. Thanks to the easy access to largescale parallel meshes through p4est (see step40 for details) in deal.II, and the fact that cellbased loops with matrixfree evaluation only need a decomposition of the mesh into chunks of roughly equal size on each processor, there is relatively little to do to write a parallel program working with distributed memory. While other tutorial programs using MPI have relied on either PETSc or Trilinos, this program uses deal.II's own parallel vector facilities.
The deal.II parallel vector class, LinearAlgebra::distributed::Vector, holds the processorlocal part of the solution as well as data fields for ghosted DoFs, i.e. DoFs that are owned by a remote processor but accessed by cells that are owned by the present processor. In the glossary these degrees of freedom are referred to as locally active degrees of freedom. The function MatrixFree::initialize_dof_vector() provides a method that sets this design. Note that hanging nodes can relate to additional ghosted degrees of freedom that must be included in the distributed vector but are not part of the locally active DoFs in the sense of the glossary. Moreover, the distributed vector holds the MPI metadata for DoFs that are owned locally but needed by other processors. A benefit of the design of this vector class is the way ghosted entries are accessed. In the storage scheme of the vector, the data array extends beyond the processorlocal part of the solution with further vector entries available for the ghosted degrees of freedom. This gives a contiguous index range for all locally active degrees of freedom. (Note that the index range depends on the exact configuration of the mesh.) Since matrixfree operations can be thought of doing linear algebra that is performance critical, and performancecritical code cannot waste time on doing MPIglobal to MPIlocal index translations, the availability of an index spaces local to one MPI rank is fundamental. The way things are accessed here is a direct array access. This is provided through LinearAlgebra::distributed::Vector::local_element(), but it is actually rarely needed because all of this happens internally in FEEvaluation.
The design of LinearAlgebra::distributed::Vector is similar to the PETScWrappers::MPI::Vector and TrilinosWrappers::MPI::Vector data types we have used in step40 and step32 before, but since we do not need any other parallel functionality of these libraries, we use the LinearAlgebra::distributed::Vector class of deal.II instead of linking in another large library in this tutorial program. Also note that the PETSc and Trilinos vectors do not provide the finegrained control over ghost entries with direct array access because they abstract away the necessary implementation details.
First include the necessary files from the deal.II library.
This includes the data structures for the efficient implementation of matrixfree methods or more generic finite element operators with the class MatrixFree.
To be efficient, the operations performed in the matrixfree implementation require knowledge of loop lengths at compile time, which are given by the degree of the finite element. Hence, we collect the values of the two template parameters that can be changed at one place in the code. Of course, one could make the degree of the finite element a runtime parameter by compiling the computational kernels for all degrees that are likely (say, between 1 and 6) and selecting the appropriate kernel at run time. Here, we simply choose second order \(Q_2\) elements and choose dimension 3 as standard.
We define a variable coefficient function for the Poisson problem. It is similar to the function in step5 but we use the form \(a(\mathbf x)=\frac{1}{0.05 + 2\\bf x\^2}\) instead of a discontinuous one. It is merely to demonstrate the possibilities of this implementation, rather than making much sense physically. We define the coefficient in the same way as functions in earlier tutorial programs. There is one new function, namely a value
method with template argument number
.
This is the new function mentioned above: Evaluate the coefficient for abstract type number
. It might be just a usual double, but it can also be a somewhat more complicated type that we call VectorizedArray. This data type is essentially a short array of doubles as discussed in the introduction that holds data from several cells. For example, we evaluate the coefficient shown here not on a simple point as usually done, but we hand it a Point<dim,VectorizedArray<double> > point, which is actually a collection of four points in the case of AVX. Do not confuse the entries in VectorizedArray with the different coordinates of the point. Indeed, the data is laid out such that p[0]
returns a VectorizedArray, which in turn contains the xcoordinate for the first point and the second point. You may access the coordinates individually using e.g. p[0][j]
, j=0,1,2,3, but it is recommended to define operations on a VectorizedArray as much as possible in order to make use of vectorized operations.
In the function implementation, we assume that the number type overloads basic arithmetic operations, so we just write the code as usual. The base class function value
is then computed from the templated function with double type, in order to avoid duplicating code.
The following class, called LaplaceOperator
, implements the differential operator. For all practical purposes, it is a matrix, i.e., you can ask it for its size (member functions m(), n()
) and you can apply it to a vector (the vmult()
function). The difference to a real matrix of course lies in the fact that this class does not actually store the elements of the matrix, but only knows how to compute the action of the operator when applied to a vector.
The infrastructure describing the matrix size, the initialization from a MatrixFree object, and the various interfaces to matrixvector products through vmult() and Tvmult() methods, is provided by the class MatrixFreeOperator::Base from which this class derives. The LaplaceOperator class defined here only has to provide a few interfaces, namely the actual action of the operator through the apply_add() method that gets used in the vmult() functions, and a method to compute the diagonal entries of the underlying matrix. We need the diagonal for the definition of the multigrid smoother. Since we consider a problem with variable coefficient, we further implement a method that can fill the coefficient values.
Note that the file include/deal.II/matrix_free/operators.h
already contains an implementation of the Laplacian through the class MatrixFreeOperators::LaplaceOperator. For educational purposes, the operator is reimplemented in this tutorial program, explaining the ingredients and concepts used there.
This program makes use of the data cache for finite element operator application that is integrated in deal.II. This data cache class is called MatrixFree. It contains mapping information (Jacobians) and index relations between local and global degrees of freedom. It also contains constraints like the ones from hanging nodes or Dirichlet boundary conditions. Moreover, it can issue a loop over all cells in parallel, making sure that only cells are worked on that do not share any degree of freedom (this makes the loop threadsafe when writing into destination vectors). This is a more advanced strategy compared to the WorkStream class described in the Parallel computing with multiple processors accessing shared memory module. Of course, to not destroy threadsafety, we have to be careful when writing into classglobal structures.
The class implementing the Laplace operator has three template arguments, one for the dimension (as many deal.II classes carry), one for the degree of the finite element (which we need to enable efficient computations through the FEEvaluation class), and one for the underlying scalar type. We want to use double
numbers (i.e., double precision, 64bit floating point) for the final matrix, but floats (single precision, 32bit floating point numbers) for the multigrid level matrices (as that is only a preconditioner, and floats can be processed twice as fast). The class FEEvaluation also takes a template argument for the number of quadrature points in one dimension. In the code below, we hardcode it to fe_degree+1
. If we wanted to change it independently of the polynomial degree, we would need to add a template parameter as is done in the MatrixFreeOperators::LaplaceOperator class.
As a sidenote, if we implemented several different operations on the same grid and degrees of freedom (like a mass matrix and a Laplace matrix), we would define two classes like the current one for each of the operators (derived from the MatrixFreeOperators::Base class), and let both of them refer to the same MatrixFree data cache from the general problem class. The interface through MatrixFreeOperators::Base requires us to only provide a minimal set of functions. This concept allows for writing complex application codes with many matrixfree operations.
VectorizedArray<number>
requires care: Here, we use the deal.II table class which is prepared to hold the data with correct alignment. However, storing e.g. an std::vector<VectorizedArray<number> >
is not possible with vectorization: A certain alignment of the data with the memory address boundaries is required (essentially, a VectorizedArray that is 32 bytes long in case of AVX needs to start at a memory address that is divisible by 32). The table class (as well as the AlignedVector class it is based on) makes sure that this alignment is respected, whereas std::vector does not in general, which may lead to segmentation faults at strange places for some systems or suboptimal performance for other systems.This is the constructor of the LaplaceOperator
class. All it does is to call the default constructor of the base class MatrixFreeOperators::Base, which in turn is based on the Subscriptor class that asserts that this class is not accessed after going out of scope e.g. in a preconditioner.
To initialize the coefficient, we directly give it the Coefficient class defined above and then select the method coefficient_function.value
with vectorized number (which the compiler can deduce from the point data type). The use of the FEEvaluation class (and its template arguments) will be explained below.
Here comes the main function of this class, the evaluation of the matrixvector product (or, in general, a finite element operator evaluation). This is done in a function that takes exactly four arguments, the MatrixFree object, the destination and source vectors, and a range of cells that are to be worked on. The method cell_loop
in the MatrixFree class will internally call this function with some range of cells that is obtained by checking which cells are possible to work on simultaneously so that write operations do not cause any race condition. Note that the cell range used in the loop is not directly the number of (active) cells in the current mesh, but rather a collection of batches of cells. In other word, "cell" may be the wrong term to begin with, since FEEvaluation groups data from several cells together. This means that in the loop over quadrature points we are actually seeing a group of quadrature points of several cells as one block. This is done to enable a higher degree of vectorization. The number of such "cells" or "cell batches" is stored in MatrixFree and can be queried through MatrixFree::n_cell_batches(). Compared to the deal.II cell iterators, in this class all cells are laid out in a plain array with no direct knowledge of level or neighborship relations, which makes it possible to index the cells by unsigned integers.
The implementation of the Laplace operator is quite simple: First, we need to create an object FEEvaluation that contains the computational kernels and has data fields to store temporary results (e.g. gradients evaluated on all quadrature points on a collection of a few cells). Note that temporary results do not use a lot of memory, and since we specify template arguments with the element order, the data is stored on the stack (without expensive memory allocation). Usually, one only needs to set two template arguments, the dimension as a first argument and the degree of the finite element as the second argument (this is equal to the number of degrees of freedom per dimension minus one for FE_Q elements). However, here we also want to be able to use float numbers for the multigrid preconditioner, which is the last (fifth) template argument. Therefore, we cannot rely on the default template arguments and must also fill the third and fourth field, consequently. The third argument specifies the number of quadrature points per direction and has a default value equal to the degree of the element plus one. The fourth argument sets the number of components (one can also evaluate vectorvalued functions in systems of PDEs, but the default is a scalar element), and finally the last argument sets the number type.
Next, we loop over the given cell range and then we continue with the actual implementation:
read_dof_values
), including the resolution of constraints. This stores \(u_\mathrm{cell}\) as described in the introduction. get_gradient
that applies the Jacobian and returns the gradient in real space. Then, we just need to multiply by the (scalar) coefficient, and let the function submit_gradient
apply the second Jacobian (for the test function) and the quadrature weight and Jacobian determinant (JxW). Note that the submitted gradient is stored in the same data field as where it is read from in get_gradient
. Therefore, you need to make sure to not read from the same quadrature point again after having called submit_gradient
on that particular quadrature point. In general, it is a good idea to copy the result of get_gradient
when it is used more often than once. distribute_local_to_global
, the same name as the corresponding function in the AffineConstraints (only that we now store the local vector in the FEEvaluation object, as are the indices between local and global degrees of freedom). This function implements the loop over all cells for the Base::apply_add() interface. This is done with the cell_loop
of the MatrixFree class, which takes the operator() of this class with arguments MatrixFree, OutVector, InVector, cell_range. When working with MPI parallelization (but no threading) as is done in this tutorial program, the cell loop corresponds to the following three lines of code:
Here, the two calls update_ghost_values() and compress() perform the data exchange on processor boundaries for MPI, once for the source vector where we need to read from entries owned by remote processors, and once for the destination vector where we have accumulated parts of the residuals that need to be added to the respective entry of the owner processor. However, MatrixFree::cell_loop does not only abstract away those two calls, but also performs some additional optimizations. On the one hand, it will split the update_ghost_values() and compress() calls in a way to allow for overlapping communication and computation. The local_apply function is then called with three cell ranges representing partitions of the cell range from 0 to MatrixFree::n_cell_batches(). On the other hand, cell_loop also supports thread parallelism in which case the cell ranges are split into smaller chunks and scheduled in an advanced way that avoids access to the same vector entry by several threads. That feature is explained in step48.
Note that after the cell loop, the constrained degrees of freedom need to be touched once more for sensible vmult() operators: Since the assembly loop automatically resolves constraints (just as the AffineConstraints::distribute_local_to_global() call does), it does not compute any contribution for constrained degrees of freedom, leaving the respective entries zero. This would represent a matrix that had empty rows and columns for constrained degrees of freedom. However, iterative solvers like CG only work for nonsingular matrices. The easiest way to do that is to set the subblock of the matrix that corresponds to constrained DoFs to an identity matrix, in which case application of the matrix would simply copy the elements of the right hand side vector into the left hand side. Fortunately, the vmult() implementations MatrixFreeOperators::Base do this automatically for us outside the apply_add() function, so we do not need to take further action here.
When using the combination of MatrixFree and FEEvaluation in parallel with MPI, there is one aspect to be careful about — the indexing used for accessing the vector. For performance reasons, MatrixFree and FEEvaluation are designed to access vectors in MPIlocal index space also when working with multiple processors. Working in local index space means that no index translation needs to be performed at the place the vector access happens, apart from the unavoidable indirect addressing. However, local index spaces are ambiguous: While it is standard convention to access the locally owned range of a vector with indices between 0 and the local size, the numbering is not so clear for the ghosted entries and somewhat arbitrary. For the matrixvector product, only the indices appearing on locally owned cells (plus those referenced via hanging node constraints) are necessary. However, in deal.II we often set all the degrees of freedom on ghosted elements as ghosted vector entries, called the locally relevant DoFs described in the glossary. In that case, the MPIlocal index of a ghosted vector entry can in general be different in the two possible ghost sets, despite referring to the same global index. To avoid problems, FEEvaluation checks that the partitioning of the vector used for the matrixvector product does indeed match with the partitioning of the indices in MatrixFree by a check called LinearAlgebra::distributed::Vector::partitioners_are_compatible. To facilitate things, the MatrixFreeOperators::Base class includes a mechanism to fit the ghost set to the correct layout. This happens in the ghost region of the vector, so keep in mind that the ghost region might be modified in both the destination and source vector after a call to a vmult() method. This is legitimate because the ghost region of a distributed deal.II vector is a mutable section and filled on demand. Vectors used in matrixvector products must not be ghosted upon entry of vmult() functions, so no information gets lost.
The following function implements the computation of the diagonal of the operator. Computing matrix entries of a matrixfree operator evaluation turns out to be more complicated than evaluating the operator. Fundamentally, we could obtain a matrix representation of the operator by applying the operator on all unit vectors. Of course, that would be very inefficient since we would need to perform n operator evaluations to retrieve the whole matrix. Furthermore, this approach would completely ignore the matrix sparsity. On an individual cell, however, this is the way to go and actually not that inefficient as there usually is a coupling between all degrees of freedom inside the cell.
We first initialize the diagonal vector to the correct parallel layout. This vector is encapsulated in a member called inverse_diagonal_entries of type DiagonalMatrix in the base class MatrixFreeOperators::Base. This member is a shared pointer that we first need to initialize and then get the vector representing the diagonal entries in the matrix. As to the actual diagonal computation, we again use the cell_loop infrastructure of MatrixFree to invoke a local worker routine called local_compute_diagonal(). Since we will only write into a vector but not have any source vector, we put a dummy argument of type unsigned int
in place of the source vector to confirm with the cell_loop interface. After the loop, we need to set the vector entries subject to Dirichlet boundary conditions to one (either those on the boundary described by the AffineConstraints object inside MatrixFree or the indices at the interface between different grid levels in adaptive multigrid). This is done through the function MatrixFreeOperators::Base::set_constrained_entries_to_one() and matches with the setting in the matrixvector product provided by the Base operator. Finally, we need to invert the diagonal entries which is the form required by the Chebyshev smoother based on the Jacobi iteration. In the loop, we assert that all entries are nonzero, because they should either have obtained a positive contribution from integrals or be constrained and treated by set_constrained_entries_to_one()
following cell_loop.
In the local compute loop, we compute the diagonal by a loop over all columns in the local matrix and putting the entry 1 in the ith slot and a zero entry in all other slots, i.e., we apply the cellwise differential operator on one unit vector at a time. The inner part invoking FEEvaluation::evaluate, the loop over quadrature points, and FEEvalution::integrate, is exactly the same as in the local_apply function. Afterwards, we pick out the ith entry of the local result and put it to a temporary storage (as we overwrite all entries in the array behind FEEvaluation::get_dof_value() with the next loop iteration). Finally, the temporary storage is written to the destination vector. Note how we use FEEvaluation::get_dof_value() and FEEvaluation::submit_dof_value() to read and write to the data field that FEEvaluation uses for the integration on the one hand and writes into the global vector on the other hand.
Given that we are only interested in the matrix diagonal, we simply throw away all other entries of the local matrix that have been computed along the way. While it might seem wasteful to compute the complete cell matrix and then throw away everything but the diagonal, the integration are so efficient that the computation does not take too much time. Note that the complexity of operator evaluation per element is \(\mathcal O((p+1)^{d+1})\) for polynomial degree \(k\), so computing the whole matrix costs us \(\mathcal O((p+1)^{2d+1})\) operations, not too far away from \(\mathcal O((p+1)^{2d})\) complexity for computing the diagonal with FEValues. Since FEEvaluation is also considerably faster due to vectorization and other optimizations, the diagonal computation with this function is actually the fastest (simple) variant. (It would be possible to compute the diagonal with sum factorization techniques in \(\mathcal O((p+1)^{d+1})\) operations involving specifically adapted kernels—but since such kernels are only useful in that particular context and the diagonal computation is typically not on the critical path, they have not been implemented in deal.II.)
Note that the code that calls distribute_local_to_global on the vector to accumulate the diagonal entries into the global matrix has some limitations. For operators with hanging node constraints that distribute an integral contribution of a constrained DoF to several other entries inside the distribute_local_to_global call, the vector interface used here does not exactly compute the diagonal entries, but lumps some contributions located on the diagonal of the local matrix that would end up in a offdiagonal position of the global matrix to the diagonal. The result is correct up to discretization accuracy as explained in Kormann (2016), section 5.3, but not mathematically equal. In this tutorial program, no harm can happen because the diagonal is only used for the multigrid level matrices where no hanging node constraints appear.
This class is based on the one in step16. However, we replaced the SparseMatrix<double> class by our matrixfree implementation, which means that we can also skip the sparsity patterns. Notice that we define the LaplaceOperator class with the degree of finite element as template argument (the value is defined at the top of the file), and that we use float numbers for the multigrid level matrices.
The class also has a member variable to keep track of all the detailed timings for setting up the entire chain of data before we actually go about solving the problem. In addition, there is an output stream (that is disabled by default) that can be used to output details for the individual setup operations instead of the summary only that is printed out by default.
Since this program is designed to be used with MPI, we also provide the usual pcout
output stream that only prints the information of the processor with MPI rank 0. The grid used for this programs can either be a distributed triangulation based on p4est (in case deal.II is configured to use p4est), otherwise it is a serial grid that only runs without MPI.
When we initialize the finite element, we of course have to use the degree specified at the top of the file as well (otherwise, an exception will be thrown at some point, since the computational kernel defined in the templated LaplaceOperator class and the information from the finite element read out by MatrixFree will not match). The constructor of the triangulation needs to set an additional flag that tells the grid to conform to the 2:1 cell balance over vertices, which is needed for the convergence of the geometric multigrid routines. For the distributed grid, we also need to specifically enable the multigrid hierarchy.
The LaplaceProblem class holds an additional output stream that collects detailed timings about the setup phase. This stream, called time_details, is disabled by default through the false
argument specified here. For detailed timings, removing the false
argument prints all the details.
The setup stage is in analogy to step16 with relevant changes due to the LaplaceOperator class. The first thing to do is to set up the DoFHandler, including the degrees of freedom for the multigrid levels, and to initialize constraints from hanging nodes and homogeneous Dirichlet conditions. Since we intend to use this programs in parallel with MPI, we need to make sure that the constraints get to know the locally relevant degrees of freedom, otherwise the storage would explode when using more than a few hundred millions of degrees of freedom, see step40.
Once we have created the multigrid dof_handler and the constraints, we can call the reinit function for the global matrix operator as well as each level of the multigrid scheme. The main action is to set up the MatrixFree
instance for the problem. The base class of the LaplaceOperator
class, MatrixFreeOperators::Base, is initialized with a shared pointer to MatrixFree object. This way, we can simply create it here and then pass it on to the system matrix and level matrices, respectively. For setting up MatrixFree, we need to activate the update flag in the AdditionalData field of MatrixFree that enables the storage of quadrature point coordinates in real space (by default, it only caches data for gradients (inverse transposed Jacobians) and JxW values). Note that if we call the reinit function without specifying the level (i.e., giving level = numbers::invalid_unsigned_int
), MatrixFree constructs a loop over the active cells. In this tutorial, we do not use threads in addition to MPI, which is why we explicitly disable it by setting the MatrixFree::AdditionalData::tasks_parallel_scheme to MatrixFree::AdditionalData::none. Finally, the coefficient is evaluated and vectors are initialized as explained above.
Next, initialize the matrices for the multigrid method on all the levels. The data structure MGConstrainedDoFs keeps information about the indices subject to boundary conditions as well as the indices on edges between different refinement levels as described in the step16 tutorial program. We then go through the levels of the mesh and construct the constraints and matrices on each level. These follow closely the construction of the system matrix on the original mesh, except the slight difference in naming when accessing information on the levels rather than the active cells.
The assemble function is very simple since all we have to do is to assemble the right hand side. Thanks to FEEvaluation and all the data cached in the MatrixFree class, which we query from MatrixFreeOperators::Base, this can be done in a few lines. Since this call is not wrapped into a MatrixFree::cell_loop (which would be an alternative), we must not forget to call compress() at the end of the assembly to send all the contributions of the right hand side to the owner of the respective degree of freedom.
The solution process is similar as in step16. We start with the setup of the transfer. For LinearAlgebra::distributed::Vector, there is a very fast transfer class called MGTransferMatrixFree that does the interpolation between the grid levels with the same fast sum factorization kernels that get also used in FEEvaluation.
As a smoother, this tutorial program uses a Chebyshev iteration instead of SOR in step16. (SOR would be very difficult to implement because we do not have the matrix elements available explicitly, and it is difficult to make it work efficiently in parallel.) The smoother is initialized with our level matrices and the mandatory additional data for the Chebyshev smoother. We use a relatively high degree here (5), since matrixvector products are comparably cheap. We choose to smooth out a range of \([1.2 \hat{\lambda}_{\max}/15,1.2 \hat{\lambda}_{\max}]\) in the smoother where \(\hat{\lambda}_{\max}\) is an estimate of the largest eigenvalue (the factor 1.2 is applied inside PreconditionChebyshev). In order to compute that eigenvalue, the Chebyshev initialization performs a few steps of a CG algorithm without preconditioner. Since the highest eigenvalue is usually the easiest one to find and a rough estimate is enough, we choose 10 iterations. Finally, we also set the inner preconditioner type in the Chebyshev method which is a Jacobi iteration. This is represented by the DiagonalMatrix class that gets the inverse diagonal entry provided by our LaplaceOperator class.
On level zero, we initialize the smoother differently because we want to use the Chebyshev iteration as a solver. PreconditionChebyshev allows the user to switch to solver mode where the number of iterations is internally chosen to the correct value. In the additional data object, this setting is activated by choosing the polynomial degree to numbers::invalid_unsigned_int
. The algorithm will then attack all eigenvalues between the smallest and largest one in the coarse level matrix. The number of steps in the Chebyshev smoother are chosen such that the Chebyshev convergence estimates guarantee to reduce the residual by the number specified in the variable smoothing_range
. Note that for solving, smoothing_range
is a relative tolerance and chosen smaller than one, in this case, we select three orders of magnitude, whereas it is a number larger than 1 when only selected eigenvalues are smoothed.
From a computational point of view, the Chebyshev iteration is a very attractive coarse grid solver as long as the coarse size is moderate. This is because the Chebyshev method performs only matrixvector products and vector updates, which typically parallelize better to the largest cluster size with more than a few tens of thousands of cores than inner product involved in other iterative methods. The former involves only local communication between neighbors in the (coarse) mesh, whereas the latter requires global communication over all processors.
The next step is to set up the interface matrices that are needed for the case with hanging nodes. The adaptive multigrid realization in deal.II implements an approach called local smoothing. This means that the smoothing on the finest level only covers the local part of the mesh defined by the fixed (finest) grid level and ignores parts of the computational domain where the terminal cells are coarser than this level. As the method progresses to coarser levels, more and more of the global mesh will be covered. At some coarser level, the whole mesh will be covered. Since all level matrices in the multigrid method cover a single level in the mesh, no hanging nodes appear on the level matrices. At the interface between multigrid levels, homogeneous Dirichlet boundary conditions are set while smoothing. When the residual is transferred to the next coarser level, however, the coupling over the multigrid interface needs to be taken into account. This is done by the socalled interface (or edge) matrices that compute the part of the residual that is missed by the level matrix with homogeneous Dirichlet conditions. We refer to the Multigrid paper by Janssen and Kanschat for more details.
For the implementation of those interface matrices, there is already a predefined class MatrixFreeOperators::MGInterfaceOperator that wraps the routines MatrixFreeOperators::Base::vmult_interface_down() and MatrixFreeOperators::Base::vmult_interface_up() in a new class with vmult()
and Tvmult()
operations (that were originally written for matrices, hence expecting those names). Note that vmult_interface_down is used during the restriction phase of the multigrid Vcycle, whereas vmult_interface_up is used during the prolongation phase.
Once the interface matrix is created, we set up the remaining Multigrid preconditioner infrastructure in complete analogy to step16 to obtain a preconditioner
object that can be applied to a matrix.
The setup of the multigrid routines is quite easy and one cannot see any difference in the solve process as compared to step16. All the magic is hidden behind the implementation of the LaplaceOperator::vmult operation. Note that we print out the solve time and the accumulated setup time through standard out, i.e., in any case, whereas detailed times for the setup operations are only printed in case the flag for detail_times in the constructor is changed.
Here is the data output, which is a simplified version of step5. We use the standard VTU (= compressed VTK) output for each grid produced in the refinement process. In addition, we use a compression algorithm that is optimized for speed rather than disk usage. The default setting (which optimizes for disk usage) makes saving the output take about 4 times as long as running the linear solver, while setting DataOutBase::CompressionLevel to best_speed lowers this to only one fourth the time of the linear solve.
We disable the output when the mesh gets too large. A variant of this program has been run on hundreds of thousands MPI ranks with as many as 100 billion grid cells, which is not directly accessible to classical visualization tools.
The function that runs the program is very similar to the one in step16. We do few refinement steps in 3D compared to 2D, but that's it.
Before we run the program, we output some information about the detected vectorization level as discussed in the introduction.
main
functionApart from the fact that we set up the MPI framework according to step40, there are no surprises in the main function.
Since this example solves the same problem as step5 (except for a different coefficient), there is little to say about the solution. We show a picture anyway, illustrating the size of the solution through both isocontours and volume rendering:
Of more interest is to evaluate some aspects of the multigrid solver. When we run this program in 2D for quadratic ( \(Q_2\)) elements, we get the following output (when run on one core in release mode):
As in step16, we see that the number of CG iterations remains constant with increasing number of degrees of freedom. A constant number of iterations (together with optimal computational properties) means that the computing time approximately quadruples as the problem size quadruples from one cycle to the next. The code is also very efficient in terms of storage. Around 24 million degrees of freedom fit into 1 GB of memory, see also the MPI results below. An interesting fact is that solving one linear system is cheaper than the setup, despite not building a matrix (approximately half of which is spent in the DoFHandler::distribute_dofs() and DoFHandler::distribute_mg_dofs() calls). This shows the high efficiency of this approach, but also that the deal.II data structures are quite expensive to set up and the setup cost must be amortized over several system solves.
Not much changes if we run the program in three spatial dimensions. Since we use uniform mesh refinement, we get eight times as many elements and approximately eight times as many degrees of freedom with each cycle:
Since it is so easy, we look at what happens if we increase the polynomial degree. When selecting the degree as four in 3D, i.e., on \(\mathcal Q_4\) elements, by changing the line const unsigned int degree_finite_element=4;
at the top of the program, we get the following program output:
Since \(\mathcal Q_4\) elements on a certain mesh correspond to \(\mathcal Q_2\) elements on half the mesh size, we can compare the run time at cycle 4 with fourth degree polynomials with cycle 5 using quadratic polynomials, both at 2.1 million degrees of freedom. The surprising effect is that the solver for \(\mathcal Q_4\) element is actually slightly faster than for the quadratic case, despite using one more linear iteration. The effect that higherdegree polynomials are similarly fast or even faster than lower degree ones is one of the main strengths of matrixfree operator evaluation through sum factorization, see the matrixfree paper. This is fundamentally different to matrixbased methods that get more expensive per unknown as the polynomial degree increases and the coupling gets denser.
In addition, also the setup gets a bit cheaper for higher order, which is because fewer elements need to be set up.
Finally, let us look at the timings with degree 8, which corresponds to another round of mesh refinement in the lower order methods:
Here, the initialization seems considerably slower than before, which is mainly due to the computation of the diagonal of the matrix, which actually computes a 729 x 729 matrix on each cell and throws away everything but the diagonal. The solver times, however, are again very close to the quartic case, showing that the linear increase with the polynomial degree that is theoretically expected is almost completely offset by better computational characteristics and the fact that higher order methods have a smaller share of degrees of freedom living on several cells that add to the evaluation complexity.
In order to understand the capabilities of the matrixfree implementation, we compare the performance of the 3d example above with a sparse matrix implementation based on TrilinosWrappers::SparseMatrix by measuring both the computation times for the initialization of the problem (distribute DoFs, setup and assemble matrices, setup multigrid structures) and the actual solution for the matrixfree variant and the variant based on sparse matrices. We base the preconditioner on float numbers and the actual matrix and vectors on double numbers, as shown above. Tests are run on an Intel Core i75500U notebook processor (two cores and AVX support, i.e., four operations on doubles can be done with one CPU instruction, which is heavily used in FEEvaluation), optimized mode, and two MPI ranks.
Sparse matrix  Matrixfree implementation  

n_dofs  Setup + assemble  Solve  Setup + assemble  Solve 
125  0.0042s  0.0012s  0.0022s  0.00095s 
729  0.012s  0.0040s  0.0027s  0.0021s 
4,913  0.082s  0.012s  0.011s  0.0057s 
35,937  0.73s  0.13s  0.048s  0.040s 
274,625  5.43s  1.01s  0.33s  0.25s 
2,146,689  43.8s  8.24s  2.42s  2.06s 
The table clearly shows that the matrixfree implementation is more than twice as fast for the solver, and more than six times as fast when it comes to initialization costs. As the problem size is made a factor 8 larger, we note that the times usually go up by a factor eight, too (as the solver iterations are constant at six). The main deviation is in the sparse matrix between 5k and 36k degrees of freedom, where the time increases by a factor 12. This is the threshold where the (L3) cache in the processor can no longer hold all data necessary for the matrixvector products and all matrix elements must be fetched from main memory.
Of course, this picture does not necessarily translate to all cases, as there are problems where knowledge of matrix entries enables much better solvers (as happens when the coefficient is varying more strongly than in the above example). Moreover, it also depends on the computer system. The present system has good memory performance, so sparse matrices perform comparably well. Nonetheless, the matrixfree implementation gives a nice speedup already for the Q_{2} elements used in this example. This becomes particularly apparent for timedependent or nonlinear problems where sparse matrices would need to be reassembled over and over again, which becomes much easier with this class. And of course, thanks to the better complexity of the products, the method gains increasingly larger advantages when the order of the elements increases (the matrixfree implementation has costs 4d^{2}p per degree of freedom, compared to 2p^{d} for the sparse matrix, so it will win anyway for order 4 and higher in 3d).
As explained in the introduction and the incode comments, this program can be run in parallel with MPI. It turns out that geometric multigrid schemes work really well and can scale to very large machines. To the authors' knowledge, the geometric multigrid results shown here are the largest computations done with deal.II as of late 2016, run on up to 147,456 cores of the complete SuperMUC Phase 1. The ingredients for scalability beyond 1000 cores are that no data structure that depends on the global problem size is held in its entirety on a single processor and that the communication is not too frequent in order not to run into latency issues of the network. For PDEs solved with iterative solvers, the communication latency is often the limiting factor, rather than the throughput of the network. For the example of the SuperMUC system, the pointtopoint latency between two processors is between 1e6 and 1e5 seconds, depending on the proximity in the MPI network. The matrixvector products with LaplaceOperator
from this class involves several pointtopoint communication steps, interleaved with computations on each core. The resulting latency of a matrixvector product is around 1e4 seconds. Global communication, for example an MPI_Allreduce
operation that accumulates the sum of a single number per rank over all ranks in the MPI network, has a latency of 1e4 seconds. The multigrid Vcycle used in this program is also a form of global communication. Think about the coarse grid solve that happens on a single processor: It accumulates the contributions from all processors before it starts. When completed, the coarse grid solution is transferred to finer levels, where more and more processors help in smoothing until the fine grid. Essentially, this is a treelike pattern over the processors in the network and controlled by the mesh. As opposed to the MPI_Allreduce
operations where the tree in the reduction is optimized to the actual links in the MPI network, the multigrid Vcycle does this according to the partitioning of the mesh. Thus, we cannot expect the same optimality. Furthermore, the multigrid cycle is not simply a walk up and down the refinement tree, but also communication on each level when doing the smoothing. In other words, the global communication in multigrid is more challenging and related to the mesh that provides less optimization opportunities. The measured latency of the Vcycle is between 6e3 and 2e2 seconds, i.e., the same as 60 to 200 MPI_Allreduce operations.
The following figure shows a scaling experiments on \(\mathcal Q_3\) elements. Along the lines, the problem size is held constant as the number of cores is increasing. When doubling the number of cores, one expects a halving of the computational time, indicated by the dotted gray lines. The results show that the implementation shows almost ideal behavior until an absolute time of around 0.1 seconds is reached. The solver tolerances have been set such that the solver performs five iterations. This way of plotting data is the strong scaling of the algorithm. As we go to very large core counts, the curves flatten out a bit earlier, which is because of the communication network in SuperMUC where communication between processors farther away is slightly slower.
In addition, the plot also contains results for weak scaling that lists how the algorithm behaves as both the number of processor cores and elements is increased at the same pace. In this situation, we expect that the compute time remains constant. Algorithmically, the number of CG iterations is constant at 5, so we are good from that end. The lines in the plot are arranged such that the top left point in each data series represents the same size per processor, namely 131,072 elements (or approximately 3.5 million degrees of freedom per core). The gray lines indicating ideal strong scaling are by the same factor of 8 apart. The results show again that the scaling is almost ideal. The parallel efficiency when going from 288 cores to 147,456 cores is at around 75% for a local problem size of 750,000 degrees of freedom per core which takes 1.0s on 288 cores, 1.03s on 2304 cores, 1.19s on 18k cores, and 1.35s on 147k cores. The algorithms also reach a very high utilization of the processor. The largest computation on 147k cores reaches around 1.7 PFLOPs/s on SuperMUC out of an arithmetic peak of 3.2 PFLOPs/s. For an iterative PDE solver, this is a very high number and significantly more is often only reached for dense linear algebra. Sparse linear algebra is limited to a tenth of this value.
As mentioned in the introduction, the matrixfree method reduces the memory consumption of the data structures. Besides the higher performance due to less memory transfer, the algorithms also allow for very large problems to fit into memory. The figure below shows the computational time as we increase the problem size until an upper limit where the computation exhausts memory. We do this for 1k cores, 8k cores, and 65k cores and see that the problem size can be varied over almost two orders of magnitude with ideal scaling. The largest computation shown in this picture involves 292 billion ( \(2.92 \cdot 10^{11}\)) degrees of freedom. On a DG computation of 147k cores, the above algorithms were also run involving up to 549 billion (2^39) DoFs.
Finally, we note that while performing the tests on the largescale system shown above, improvements of the multigrid algorithms in deal.II have been developed. The original version contained the suboptimal code based on MGSmootherPrecondition where some MPI_Allreduce commands (checking whether all vector entries are zero) were done on each smoothing operation on each level, which only became apparent on 65k cores and more. However, the following picture shows that the improvement already pay off on a smaller scale, here shown on computations on up to 14,336 cores for \(\mathcal Q_5\) elements:
As explained in the code, the algorithm presented here is prepared to run on adaptively refined meshes. If only part of the mesh is refined, the multigrid cycle will run with local smoothing and impose Dirichlet conditions along the interfaces which differ in refinement level for smoothing through the MatrixFreeOperators::Base class. Due to the way the degrees of freedom are distributed over levels, relating the owner of the level cells to the owner of the first descendant active cell, there can be an imbalance between different processors in MPI, which limits scalability by a factor of around two to five.
As mentioned above the code is ready for locally adaptive hrefinement. For the Poisson equation one can employ the Kelly error indicator, implemented in the KellyErrorEstimator class. However one needs to be careful with the ghost indices of parallel vectors. In order to evaluate the jump terms in the error indicator, each MPI process needs to know locally relevant DoFs. However MatrixFree::initialize_dof_vector() function initializes the vector only with some locally relevant DoFs. The ghost indices made available in the vector are a tight set of only those indices that are touched in the cell integrals (including constraint resolution). This choice has performance reasons, because sending all locally relevant degrees of freedom would be too expensive compared to the matrixvector product. Consequently the solution vector asis is not suitable for the KellyErrorEstimator class. The trick is to change the ghost part of the partition, for example using a temporary vector and LinearAlgebra::distributed::Vector::copy_locally_owned_data_from() as shown below.
This program is parallelized with MPI only. As an alternative, the MatrixFree loop can also be issued in hybrid mode, for example by using MPI parallelizing over the nodes of a cluster and with threads through Intel TBB within the shared memory region of one node. To use this, one would need to both set the number of threads in the MPI_InitFinalize data structure in the main function, and set the MatrixFree::AdditionalData::tasks_parallel_scheme to partition_color to actually do the loop in parallel. This use case is discussed in step48.
The presented program assumes homogeneous Dirichlet boundary conditions. When going to nonhomogeneous conditions, the situation is a bit more intricate. To understand how to implement such a setting, let us first recall how these arise in the mathematical formulation and how they are implemented in a matrixbased variant. In essence, an inhomogeneous Dirichlet condition sets some of the nodal values in the solution to given values rather than determining them through the variational principles,
\begin{eqnarray*} u_h(\mathbf{x}) = \sum_{i\in \mathcal N} \varphi_i(\mathbf{x}) u_i = \sum_{i\in \mathcal N \setminus \mathcal N_D} \varphi_i(\mathbf{x}) u_i + \sum_{i\in \mathcal N_D} \varphi_i(\mathbf{x}) g_i, \end{eqnarray*}
where \(u_i\) denotes the nodal values of the solution and \(\mathcal N\) denotes the set of all nodes. The set \(\mathcal N_D\subset \mathcal N\) is the subset of the nodes that are subject to Dirichlet boundary conditions where the solution is forced to equal \(u_i = g_i = g(\mathbf{x}_i)\) as the interpolation of boundary values on the Dirichletconstrained node points \(i\in \mathcal N_D\). We then insert this solution representation into the weak form, e.g. the Laplacian shown above, and move the known quantities to the right hand side:
\begin{eqnarray*} (\nabla \varphi_i, \nabla u_h)_\Omega &=& (\varphi_i, f)_\Omega \quad \Rightarrow \\ \sum_{j\in \mathcal N \setminus \mathcal N_D}(\nabla \varphi_i,\nabla \varphi_j)_\Omega \, u_j &=& (\varphi_i, f)_\Omega \sum_{j\in \mathcal N_D} (\nabla \varphi_i,\nabla\varphi_j)_\Omega\, g_j. \end{eqnarray*}
In this formula, the equations are tested for all basis functions \(\varphi_i\) with \(i\in N \setminus \mathcal N_D\) that are not related to the nodes constrained by Dirichlet conditions.
In the implementation in deal.II, the integrals \((\nabla \varphi_i,\nabla \varphi_j)_\Omega\) on the right hand side are already contained in the local matrix contributions we assemble on each cell. When using AffineConstraints::distribute_local_to_global() as first described in the step6 and step7 tutorial programs, we can account for the contribution of inhomogeneous constraints j by multiplying the columns j and rows i of the local matrix according to the integrals \((\varphi_i, \varphi_j)_\Omega\) by the inhomogeneities and subtracting the resulting from the position i in the global righthandside vector, see also the Constraints on degrees of freedom module. In essence, we use some of the integrals that get eliminated from the left hand side of the equation to finalize the right hand side contribution. Similar mathematics are also involved when first writing all entries into a left hand side matrix and then eliminating matrix rows and columns by MatrixTools::apply_boundary_values().
In principle, the components that belong to the constrained degrees of freedom could be eliminated from the linear system because they do not carry any information. In practice, in deal.II we always keep the size of the linear system the same to avoid handling two different numbering systems and avoid confusion about the two different index sets. In order to ensure that the linear system does not get singular when not adding anything to constrained rows, we then add dummy entries to the matrix diagonal that are otherwise unrelated to the real entries.
In a matrixfree method, we need to take a different approach, since the LaplaceOperator
class represents the matrixvector product of a homogeneous operator (the lefthand side of the last formula). It does not matter whether the AffineConstraints object passed to the MatrixFree::reinit() contains inhomogeneous constraints or not, the MatrixFree::cell_loop() call will only resolve the homogeneous part of the constraints as long as it represents a linear operator.
In our matrixfree code, the right hand side computation where the contribution of inhomogeneous conditions ends up is completely decoupled from the matrix operator and handled by a different function above. Thus, we need to explicitly generate the data that enters the right hand side rather than using a byproduct of the matrix assembly. Since we already know how to apply the operator on a vector, we could try to use those facilities for a vector where we only set the Dirichlet values:
or, equivalently, if we already had filled the inhomogeneous constraints into an AffineConstraints object,
We could then pass the vector solution
to the LaplaceOperator::vmult_add()
function and add the new contribution to the system_rhs
vector that gets filled in the LaplaceProblem::assemble_rhs()
function. However, this idea does not work because the FEEvaluation::read_dof_values() call used inside the vmult() functions assumes homogeneous values on all constraints (otherwise the operator would not be a linear operator but an affine one). To also retrieve the values of the inhomogeneities, we could select one of two following strategies.
The class FEEvaluation has a facility that addresses precisely this requirement: For nonhomogeneous Dirichlet values, we do want to skip the implicit imposition of homogeneous (Dirichlet) constraints upon reading the data from the vector solution
. For example, we could extend the LaplaceProblem::assemble_rhs()
function to deal with inhomogeneous Dirichlet values as follows, assuming the Dirichlet values have been interpolated into the object constraints:
In this code, we replaced the FEEvaluation::read_dof_values() function for the tentative solution vector by FEEvaluation::read_dof_values_plain() that ignores all constraints. Due to this setup, we must make sure that other constraints, e.g. by hanging nodes, are correctly distributed to the input vector already as they are not resolved as in FEEvaluation::read_dof_values_plain(). Inside the loop, we then evaluate the Laplacian and repeat the second derivative call with FEEvaluation::submit_gradient() from the LaplaceOperator
class, but with the sign switched since we wanted to subtract the contribution of Dirichlet conditions on the right hand side vector according to the formula above. When we invoke the FEEvaluation::integrate() call, we then set both arguments regarding the value slot and first derivative slot to true to account for both terms added in the loop over quadrature points. Once the right hand side is assembled, we then go on to solving the linear system for the homogeneous problem, say involving a variable solution_update
. After solving, we can add solution_update
to the solution
vector that contains the final (inhomogeneous) solution.
Note that the negative sign for the Laplacian alongside with a positive sign for the forcing that we needed to build the right hand side is a more general concept: We have implemented nothing else than Newton's method for nonlinear equations, but applied to a linear system. We have used an initial guess for the variable solution
in terms of the Dirichlet boundary conditions and computed a residual \(r = f  Au_0\). The linear system was then solved as \(\Delta u = A^{1} (fAu)\) and we finally computed \(u = u_0 + \Delta u\). For a linear system, we obviously reach the exact solution after a single iteration. If we wanted to extend the code to a nonlinear problem, we would rename the assemble_rhs()
function into a more descriptive name like assemble_residual()
that computes the (weak) form of the residual, whereas the LaplaceOperator::apply_add()
function would get the linearization of the residual with respect to the solution variable.
A second alternative to get the right hand side that reuses the LaplaceOperator::apply_add()
function is to instead add a second LaplaceOperator that skips Dirichlet constraints. To do this, we initialize a second MatrixFree object which does not have any boundary value constraints. This matrix_free
object is then passed to a LaplaceOperator
class instance inhomogeneous_operator
that is only used to create the right hand side:
A more sophisticated implementation of this technique could reuse the original MatrixFree object. This can be done by initializing the MatrixFree object with multiple blocks, where each block corresponds to a different AffineConstraints object. Doing this would require making substantial modifications to the LaplaceOperator class, but the MatrixFreeOperators::LaplaceOperator class that comes with the library can do this. See the discussion on blocks in MatrixFreeOperators::Base for more information on how to set up blocks.
While the performance achieved in this tutorial program is already very good, there is functionality in deal.II to further improve the performance. On the one hand, increasing the polynomial degree to three or four will further improve the time per unknown. (Even higher degrees typically get slower again, because the multigrid iteration counts increase slightly with the chosen simple smoother. One could then use hybrid multigrid algorithms to use polynomial coarsening through MGTransferGlobalCoarsening, to reduce the impact of the coarser level on the communication latency.) A more significant improvement can be obtained by datalocality optimizations. The class PreconditionChebyshev, when combined with a DiagonalMatrix
inner preconditioner as in the present class, can overlap the vector operations with the matrixvector product. As the former are typically constrained by memory bandwidth, reducing the number of loads helps to achieve this goal. The two ingredients to achieve this are
vmult
function that takes two std::function
objects, which can be passed on to MatrixFree::cell_loop with the respective signature (PreconditionChebyshev will then pick up this interface and schedule its vector operations), and